The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::Data::Stream - Read, Write, and Manipulate Data File Line by Line

SYNOPSIS

  use Bio::ToolBox::Data;
  
  ### Open a pre-existing file
  my $Stream = Bio::ToolBox::Data->new(
        in      => 'regions.bed',
        stream  => 1,
  );
  
  # or directly
  my $Stream = Bio::ToolBox::Data::Stream->new(
        in      => 'regions.bed',
  );
  
  ### Open a new file for writing
  my $Stream = Bio::ToolBox::Data::Stream->new(
        out     => 'output.txt',
        columns => [qw(chromosome start stop name)],
  );
  
  
  ### Working line by line
  while (my $line = $Stream->next_line) {
          # get the positional information from the file data
          # assuming that the input file had these identifiable columns
          # each line is Bio::ToolBox::Data::Feature item
          my $seq_id = $line->seq_id;
          my $start  = $line->start;
          my $stop   = $line->end;
          
          # change values
          $line->value(1, 100); # index, new value
  }
  
  
  ### Working with two file streams
  my $inStream = Bio::ToolBox::Data::Stream->new(
        file    => 'regions.bed',
  );
  my $outStream = $inStream->duplicate('regions_ext100.bed');
  my $sc = $inStream->start_column;
  my $ec = $inStream->end_column;
  while (my $line = $inStream->next_line) {
      # adjust positions by 100 bp
      my $s = $line->start;
      my $e = $line->end;
      $line->value($sc, $s - 100);
      $line->value($ec, $e + 100);
      $outStream->write_row($line);
  }
  
  
  ### Finishing
  # close your file handles when you are done
  $Stream->close_fh;

DESCRIPTION

This module works similarly to the Bio::ToolBox::Data object, except that rows are read from a file handle rather than a memory structure. This allows very large files to be read, manipulated, and even written without slurping the entire contents into a memory.

For an introduction to the Bio::ToolBox::Data object and methods, refer to its documentation and the Bio::ToolBox::Data::Feature documentation.

Typically, manipulations are only performed on one row at a time, not on an entire table. Therefore, large scale table manipulations, such as sorting, is not possible.

A typical workflow consists of opening two Stream objects, one for reading and one for writing. Rows are read, one at a time, from the read Stream, manipulated as necessary, and then written to the write Stream. Each row is passed as a Bio::ToolBox::Data::Feature object. It can be manipulated as such, or the corresponding values may be dumped as an array. Working with the row data as an array is required when adding or deleting columns, since these manipulations are not allowed with a Feature object. The write Stream can then be passed either the Feature object or the array of values to be written.

METHODS

Initializing the structure

A new Bio::ToolBox::Data::Stream object may be generated directly, or indirectly through the Bio::ToolBox::Data module.

new
        my $Stream = Bio::ToolBox::Data::Stream->new(
           in           => $filename,
        );
        my $Stream = Bio::ToolBox::Data->new(
           stream       => 1,
           in           => $filename,
        );

Options to the new function are listed below. Streams are inherently either read or write mode, determined by the mode given through the options.

in

Provide the path of the file to open for reading. File types are recognized by the extension, and compressed files (.gz) are supported. File types supported include all those listed in Bio::ToolBox::file_helper.

out

Provide the path of the file to open for writing. No check is made for pre-existing files; if it exists it will be overwritten! A new data object is prepared, therefore column names must be provided.

noheader

Boolean option indicating that the input file does not have file headers, in which case dummy headers are provided. This is not necessary for defined file types that don't normally have file headers, such as BED, GFF, or UCSC files. Ignored for output files.

columns
        my $Stream = Bio::ToolBox::Data::Stream->new(
           out      => $filename,
           columns  => [qw(Column1 Column2 ...)],
        );

When a new file is written, provide the names of the columns as an anonymous array. If no columns are provided, then a completely empty data structure is made. Columns must be added with the add_column() method below.

gff

When writing a GFF file, provide a GFF version. When this is given, the nine standard column names and metadata are automatically provided based on the file format specification. Note that the column names are not actually written in the file, but are maintained for internal use. Acceptable versions include 1, 2, 2.5 (GTF), and 3 (GFF3).

bed

When writing a BED file, provide the number of bed columns that the file will have. When this is given, the standard column names and metadata will be automatically provided based on the standard file format specification. Note that column names are not actually written to the file, but are maintained for internal use. Acceptable values are integers from 3 to 12.

ucsc

When writing a UCSC-style file format, provide the number of bed columns that the file will have. When this is given, the standard column names and metadata will be automatically provided based on the file format specification. Note that column names are not actually written to the file, but are maintained for internal use. Acceptable values include 10 (refFlat without gene names), 11 (refFlat with gene names), 12 (knownGene gene prediction table), and 15 (an extended gene prediction or genePredExt table).

gz

Boolean value to change the compression status of the output file. If overwriting an input file, the default is maintain the compression status, otherwise no compression. Pass a 0 for no compression, 1 for standard gzip compression, or 2 for block gzip (bgzip) compression for tabix compatibility.

duplicate
   my $Out_Stream = $Stream->duplicate($new_filename);

For an opened-to-read Stream object, you may duplicate the object as a new opened-to_write Stream object that maintains the same columns and metadata. A new different filename must be provided.

General Metadata

There is a variety of general metadata regarding the Data structure that is available.

The following methods may be used to access or set these metadata properties. Note that metadata is only written at the beginning of the file, and so must be set prior to iterating through the file.

feature

Returns or sets the name of the features used to collect the list of features. The actual feature types are listed in the table, so this metadata is merely descriptive.

feature_type

Returns one of three specific values describing the contents of the data table inferred by the presence of specific column names. This provides a clue as to whether the table features represent genomic regions (defined by coordinate positions) or named database features. The return values include:

coordinate: Table includes at least chromosome and start
named: Table includes name, type, and/or Primary_ID
unknown: unrecognized
program

Returns or sets the name of the program generating the list.

database

Returns or sets the name or path of the database from which the features were derived.

gff

Returns or sets the version of loaded GFF files. Supported versions included 1, 2, 2.5 (GTF), and 3.

bed

Returns or sets the BED file version. Here, the BED version is simply the number of columns.

ucsc

Returns or sets the UCSC file format version. Here, the version is simply the number of columns. Supported versions include 10 (gene prediction), 11 (refFlat, or gene prediction with gene name), 12 (knownGene table), 15 (extended gene prediction), or 16 (extended gene prediction with bin).

vcf

Returns or sets the VCF file version number. VCF support is limited.

File information

These methods provide information about the file from which the data table was loaded. This does not include parsed annotation tables.

filename
path
basename
extension

Returns the filename, full path, basename, and extension of the filename. Concatenating the last three values will reconstitute the first original filename.

add_file_metadata
  $Data->add_file_metadata('/path/to/file.txt');

Add filename metadata. This will automatically parse the path, basename, and recognized extension from the passed filename and set the appropriate metadata attributes.

Comments

Comments are the other commented lines from a text file (lines beginning with a #) that were not parsed as metadata.

comments

Returns a copy of the array containing commented lines.

add_comment

Appends the text string to the comment array.

delete_comment

Deletes a comment. Provide the array index of the comment to delete. If an index is not provided, ALL comments will be deleted!

vcf_headers

For VCF files, this will partially parse the VCF headers into a hash structure that can be queried or manipulated. Each header line is parsed for the primary key, being the first word after the ## prefix, e.g. INFO, FORMAT, FILTER, contig, etc. For the simple values, they are stored as the value. For complex entries, such as with INFO and FORMAT, a second level hash is created with the ID extracted and used as the second level key. The value is always the always the remainder of the string.

For example, the following would be a simple parsed vcf header in code representation.

  $vcf_header = {
     FORMAT => {
        GT = q(ID=GT,Number=1,Type=String,Description="Genotype"),
        AD = q(ID=AD,Number=.,Type=Integer,Description="ref,alt Allelic depths"),
     },
     fileDate => 20150715,
  }
rewrite_vcf_headers

If you have altered the vcf headers exported by the vcf_headers() method, then this method will rewrite the hash structure as new comment lines. Do this prior to writing the new file stream or else you will lose your changed VCF header metadata.

Column Metadata

Information about the columns may be accessed. This includes the names of the column and shortcuts to specific identifiable columns, such as name and coordinates. In addition, each column may have additional metadata. Each metadata is a series of key => value pairs. The minimum keys are 'index' (the 0-based index of the column) and 'name' (the column header name). Additional keys and values may be queried or set as appropriate. When the file is written, these are stored as commented metadata lines at the beginning of the file. Setting metadata is futile after reading or writing has begun.

list_columns

Returns an array or array reference of the column names in ascending (left to right) order.

number_columns

Returns the number of columns in the Data table.

last_column

Returns the array index of the last (rightmost) column in the Data table.

name
  $Stream->name($index, $new_name);
  my $name = $Stream->name($i);

Convenient method to return the name of the column given the index number. A column may also be renamed by passing a new name.

metadata
  $Stream->metadata($index, $key, $new_value);
  my $value = $Stream->metadata($index, $key)

Returns or sets the metadata value for a specific $key for a specific column $index.

This may also be used to add a new metadata key. Simply provide the name of a new $key that is not present

If no key is provided, then a hash or hash reference is returned representing the entire metadata for that column.

copy_metadata
  $Stream->copy_metadata($source, $target);

This method will copy the metadata (everything except name and index) between the source column and target column. Returns 1 if successful.

delete_metadata
  $Stream->delete_metadata($index, $key);

Deletes a column-specific metadata $key and value for a specific column $index. If a $key is not provided, then all metadata keys for that index will be deleted.

find_column
  my $i = $Stream->find_column('Gene');
  my $i = $Stream->find_column('^Gene$')

Searches the column names for the specified column name. This employs a case-insensitive grep search, so simple substitutions may be made.

chromo_column
start_column
stop_column
strand_column
name_column
type_column
id_column

These methods will return the identified column best matching the description. Returns undef if that column is not present. These use the "find_column" method with a predefined list of aliases.

Modifying Columns

These methods allow modification to the number and order of the columns in a Stream object. These methods can only be employed prior to opening a file handle for writing, i.e. before the first "write_row" method is called. This enables one, for example, to duplicate a read-only Stream object to create a write-only Stream, add or delete columns, and then begin the row iteration.

add_column
  my $i = $Stream->add_column($name);

Appends a new column at the rightmost position (highest index). It adds the column header name and creates a new column metadata hash. Pass a text string representing the new column name. It returns the new column index if successful.

copy_column
  my $j = $Stream->copy_column($i);

This will copy a column, appending the duplicate column at the rightmost position (highest index). It will duplicate column metadata as well. It will return the new index position.

delete_column

Deletes one or more specified columns. Any remaining columns rightwards will have their indices shifted down appropriately. If you had identified one of the shifted columns, you may need to re-find or calculate its new index.

reorder_column
  $Data->reorder_column($c,$b,$a,$a);

Reorders columns into the specified order. Provide the new desired order of indices. Columns could be duplicated or deleted using this method. The columns will adopt their new index numbers.

Row Data Access

Once a file Stream object has been opened, and metadata and/or columns adjusted as necessary, then the file contents can be iterated through, one row at a time. This is typically a one-way direction. If you need to go back or start over, the easiest thing to do is re-open the file as a new Stream object.

There are two main methods, "next_row" for reading and "write_row" for writing. They cannot and should not be used on the same Stream object.

next_row
next_line
read_line

This method reads the next line in the file handle and returns a Bio::ToolBox::Data::Feature object. This object represents the values in the current file row.

Note that strand values and 0-based start coordinates are automatically converted to BioPerl conventions if required by the file type.

add_row
add_line
write_row
write_line
  $Data->add_row(\@values);
  $Data->add_row($Row); # Bio::ToolBox::Data::Feature object

This method writes a new row or line to a file handle. The first time this method is called the file handle is automatically opened for writing. Up to this point, columns may be manipulated. After this point, columns cannot be adjusted (otherwise the file structure becomes inconsistent).

This method may be implemented in one of three ways, based on the type data that is passed.

  • A Feature object

    A Feature object representing a row from another Bio::ToolBox::Data data table or Stream. The values from this object will be automatically obtained. Modified strand and 0-based coordinates may be adjusted back as necessary.

  • An array reference of values

    Pass an array reference of values. The number of elements should match the number of expected columns. The values will be automatically joined using tabs. This implementation should be used if you using values from another Stream and the number of columns have been modified.

    Manipulation of strand and 0-based starts may be performed if the metadata indicates this should be done.

  • A string

    Pass a text string. This assumes the column values are already tab concatenated. A new line character is appended if one is not included. No data manipulation (strand or 0-based starts) or sanity checking of the required number of columns is performed. Use with caution!

iterate
    $Stream->iterate( sub {
       my $row = shift;
       my $number = $row->value($index);
       my $log_number = log($number);
       $row->value($index, $log_number);
    } );

A convenience method that will process a code reference for every line in the file. Pass a subroutine or code reference. The subroutine will receive the line as a Bio::ToolBox::Data::Feature object, just as with the "read_line" method.

File Handle methods

The below methods work with the file handle. When you are finished with a Stream, you should be kind and close the file handle properly.

mode

Returns the write mode of the Stream object. Read-only objects return false (0) and write-only Stream objects return true (1).

close_fh

Closes the file handle.

fh

Returns the IO::File compatible file handle object representing the file handle. Use with caution.

SEE ALSO

Bio::ToolBox::Data, Bio::ToolBox::Data::Feature

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.