The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::Data::Feature - Objects representing rows in a data table

DESCRIPTION

A Bio::ToolBox::Data::Feature is an object representing a row in the data table. Usually, this in turn represents an annotated feature or segment in the genome. As such, this object provides convenient methods for accessing and manipulating the values in a row, as well as methods for working with the represented genomic feature.

This class should NOT be used directly by the user. Rather, Feature objects are generated from a Bio::ToolBox::Data::Iterator object (generated itself from the row_stream function in Bio::ToolBox::Data), or the iterate function in Bio::ToolBox::Data. Please see the respective documentation for more information.

Example of working with a stream object.

          my $Data = Bio::ToolBox::Data->new(file => $file);
          
          # stream method
          my $stream = $Data->row_stream;
          while (my $row = $stream->next_row) {
                 # each $row is a Bio::ToolBox::Data::Feature object
                 # representing the row in the data table
                 my $value = $row->value($index);
                 # do something with $value
          }
          
          # iterate method
          $Data->iterate( sub {
             my $row = shift;
             my $number = $row->value($index);
             my $log_number = log($number);
             $row->value($index, $log_number);
          } );

METHODS

General information methods

row_index

Returns the index position of the current data row within the data table. Useful for knowing where you are at within the data table.

feature_type

Returns one of three specific values describing the contents of the data table inferred by the presence of specific column names. This provides a clue as to whether the table features represent genomic regions (defined by coordinate positions) or named database features. The return values include:

coordinate: Table includes at least chromosome and start
named: Table includes name, type, and/or Primary_ID
unknown: unrecognized
column_name

Returns the column name for the given index.

item data

Returns the parent Bio::ToolBox::Data object, in case you may have lost it by going out of scope.

Methods to access row feature attributes

These methods return the corresponding value, if present in the data table, based on the column header name. If the row represents a named database object, try calling the feature() method first. This will retrieve the database SeqFeature object, and the attributes can then be retrieved using the methods below or on the actual database SeqFeature object.

These methods do not set attribute values. If you need to change the values in a table, use the value() method below.

seq_id

The name of the chromosome the feature is on.

start
end
stop

The coordinates of the feature or segment. Coordinates from known 0-based file formats, e.g. BED, are returned as 1-based. Coordinates must be integers to be returned. Zero or negative start coordinates are assumed to be accidents or poor programming and transformed to 1. Use the value() method if you don't want this to happen.

strand

The strand of the feature or segment. Returns -1, 0, or 1. Default is 0, or unstranded.

name
display_name

The name of the feature.

coordinate

Returns a coordinate string formatted as "seqid:start-stop".

type

The type of feature. Typically either primary_tag or primary_tag:source_tag. In a GFF3 file, this represents columns 3 and 2, respectively. In annotation databases such as Bio::DB::SeqFeature::Store, the type is used to restrict to one of many different types of features, e.g. gene, mRNA, or exon.

id
primary_id

Here, this represents the primary_ID in the database. Note that this number is generally unique to a specific database, and not portable between databases.

length

The length of the feature or segment.

score

Returns the value of the Score column, if one is available. Typically associated with defined file formats, such as GFF files (6th column), BED and related Peak files (5th column), and bedGraph (4th column).

Accessing and setting values in the row.

value($index)
value($index, $new_value)

Returns or sets the value at a specific column index in the current data row. Null values return a '.', symbolizing an internal null value.

row_values

Returns an array or array reference representing all the values in the current data row.

Special feature attributes

GFF and VCF files have special attributes in the form of key = value pairs. These are stored as specially formatted, character-delimited lists in certain columns. These methods will parse this information and return as a convenient hash reference. The keys and values of this hash may be changed, deleted, or added to as desired. To write the changes back to the file, use the rewrite_attributes() to properly write the attributes back to the file with the proper formatting.

attributes

Generic method that calls either gff_attributes() or vcf_attributes() depending on the data table format.

gff_attributes

Parses the 9th column of GFF files. URL-escaped characters are converted back to text. Returns a hash reference of key => value pairs.

vcf_attributes

Parses the INFO (8th column) and all sample columns (10th and higher columns) in a version 4 VCF file. The Sample columns use the FORMAT column (9th column) as keys. The returned hash reference has two levels: The first level keys are both the column names and index (0-based). The second level keys are the individual attribute keys to each value. For example:

   my $attr = $row->vcf_attributes;
   # access by column name
   my $genotype = $attr->{sample1}{GT};
   my $depth = $attr->{INFO}{ADP};
   # access by 0-based column index 
   my $genotype = $attr->{9}{GT};
   my $depth = $attr->{7}{ADP}
rewrite_attributes

Generic method that either calls rewrite_gff_attributes() or rewrite_vcf_attributes() depending on the data table format.

rewrite_gff_attributes

Rewrites the GFF attributes column (the 9th column) based on the contents of the attributes hash that was previously generated with the gff_attributes() method. Useful when you have modified the contents of the attributes hash.

rewrite_vcf_attributes

Rewrite the VCF attributes for the INFO (8th column), FORMAT (9th column), and sample columns (10th and higher columns) based on the contents of the attributes hash that was previously generated with the vcf_attributes() method. Useful when you have modified the contents of the attributes hash.

Convenience Methods to database functions

The next three functions are convenience methods for using the attributes in the current data row to interact with databases. They are wrappers to methods in the <Bio::ToolBox::db_helper> module.

seqfeature
feature

Returns a SeqFeature object representing the feature or item in the current row. If the SeqFeature object is stored in the parent $Data object, it is retrieved from there. Otherwise, the SeqFeature object is retrieved from the database using the name and type values in the current Data table row. The SeqFeature object is requested from the database named in the general metadata. If an alternate database is desired, you should change it first using the $Data->database() method. If the feature name or type is not present in the table, then nothing is returned.

See <Bio::ToolBox::SeqFeature> and Bio::SeqFeatureI for more information about working with these objects.

segment

Returns a database Segment object corresponding to the coordinates defined in the Data table row. If a named feature and type are present instead of coordinates, then the feature is first automatically retrieved and a Segment returned based on its coordinates. The database named in the general metadata is used to establish the Segment object. If a different database is desired, it should be changed first using the general database() method.

See Bio::DB::SeqFeature::Segment and Bio::RangeI for more information about working with Segment objects.

get_features(%args)

Returns seqfeature objects from a database that overlap the Feature or interval in the current Data table row. This is essentially a convenience wrapper for a Bio::DB style features method using the coordinates of the Feature. Optionally pass an array of key value pairs to specify alternate coordinates if so desired. Potential keys include

seq_id
start
end
type The type of database features to retrieve.
db An alternate database object to collect from.
get_sequence(%args)

Fetches genomic sequence based on the coordinates of the current seqfeature or interval in the current Feature. This requires a database that contains the genomic sequence, either the database specified in the Data table metadata or an external indexed genomic fasta file. The sequence is returned as simple string. If the feature is on the reverse strand, then the reverse complement sequence is automatically returned. Pass an array of key value pairs to specify alternate coordinates if so desired. Potential keys include

seq_id
start
end
strand
extend Indicate additional basepairs of sequence added to both sides
db The fasta file or database from which to fetch the sequence

Data collection

The following methods allow for data collection from various sources, including bam, bigwig, bigbed, useq, Bio::DB databases, etc.

get_score(%args)

This method collects a single score over the feature or interval. Usually a mathematical or statistical value is employed to derive the single score. Pass an array of key value pairs to control data collection. Keys include the following:

db
ddb

Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.

dataset

Specify the name of the dataset. If a database was specified, then this value would be the primary_tag or type:source feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required!

method

Specify the mathematical or statistical method combining multiple scores over the interval into one value. Options include the following:

mean (default)
sum
min
max
median
count Count all overlapping items.
pcount Precisely count only containing (not overlapping) items.
ncount Count overlapping unique names only.
range (Min - Max difference)
stddev Standard deviation
strandedness

Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.

sense The same strand as the Feature.
antisense The opposite strand as the Feature.
all Strand is ignored, all is taken (default).
exon

Boolean option to indicate that the data collection should only occur over exonic subfeatures, and not over introns. Requires that the Data table Feature be a named SeqFeature gene that contains exon subfeatures, for example parsed from a gene table file.

extend

Specify the number of basepairs that the Data table Feature's coordinates should be extended in both directions.

seq_id
chromo
start
end
stop
strand

Optionally specify zero or more alternate coordinates to use. By default, these are obtained from the Data table Feature.

Example:

  while (my $row = $stream->next_row) {
     my $score = $row->get_score(
        'method'    => 'mean',
        'dataset'   => '/path/to/MyData.bw',
        'exon'      => 1,
     );
  }
get_relative_point_position_scores(%args)

This method collects indexed position scores centered around a specific reference point. The returned data is a hash of relative positions (example -20, -10, 1, 10, 20) and their score values. Pass an array of key value pairs to control data collection. Keys include the following:

db
ddb

Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.

dataset

Specify the name of the dataset. If a database was specified, then this value would be the primary_tag or type:source feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required!

position

Indicate the position of the reference point relative to the Data table Feature. 5 is the 5' coordinate, 3 is the 3' coordinate, and 4 is the midpoint (get it? it's between 5 and 3). Default is 5.

extend

Indicate the number of base pairs to extend from the reference coordinate. This option is required!

coordinate

Optionally provide the real chromosomal coordinate as the reference point.

absolute

Boolean option to indicate that the returned hash of positions and scores should not be transformed into relative positions but kept as absolute chromosomal coordinates.

avoid

Provide a primary_tag:source database feature type to avoid overlapping scores. Each found score is checked for overlapping features and is discarded if found to do so. The database should be set to use this.

strandedness

Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.

sense The same strand as the Feature.
antisense The opposite strand as the Feature.
all Strand is ignored, all is taken (default).
method

Only required when counting objects.

count Count all overlapping items.
pcount Precisely count only containing (not overlapping) items.
ncount Count overlapping unique names only.

Example:

  while (my $row = $stream->next_row) {
     my $pos2score = $row->get_relative_point_position_scores(
        'ddb'       => '/path/to/BigWigSet/',
        'dataset'   => 'MyData',
        'position'  => 5,
        'extend'    => 1000,
     );
  }
get_region_position_scores(%args)

This method collects indexed position scores across a defined region or interval. The returned data is a hash of positions and their score values. The positions are by default relative to a region coordinate, usually to the 5' end. Pass an array of key value pairs to control data collection. Keys include the following:

db
ddb

Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.

dataset

Specify the name of the dataset. If a database was specified, then this value would be the primary_tag or type:source feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required!

exon

Boolean option to indicate that the data collection should only occur over exonic subfeatures, and not over introns. Requires that the Data table Feature be a named SeqFeature gene that contains exon subfeatures, for example parsed from a gene table file.

When converting to relative coordinates, the coordinates will be relative to the length of the sum of the exons, i.e. the length of the introns will be ignored.

extend

Specify the number of basepairs that the Data table Feature's coordinates should be extended in both directions.

seq_id
chromo
start
end
stop
strand

Optionally specify zero or more alternate coordinates to use. By default, these are obtained from the Data table Feature.

position

Indicate the position of the reference point relative to the Data table Feature. 5 is the 5' coordinate, 3 is the 3' coordinate, and 4 is the midpoint (get it? it's between 5 and 3). Default is 5.

coordinate

Optionally provide the real chromosomal coordinate as the reference point.

absolute

Boolean option to indicate that the returned hash of positions and scores should not be transformed into relative positions but kept as absolute chromosomal coordinates.

avoid

Provide a primary_tag:source database feature type to avoid overlapping scores. Each found score is checked for overlapping features and is discarded if found to do so. The database should be set to use this.

strandedness

Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.

sense The same strand as the Feature.
antisense The opposite strand as the Feature.
all Strand is ignored, all is taken (default).
method

Only required when counting objects.

count Count all overlapping items.
pcount Precisely count only containing (not overlapping) items.
ncount Count overlapping unique names only.

Example:

  while (my $row = $stream->next_row) {
     my $pos2score = $row->get_relative_point_position_scores(
        'ddb'       => '/path/to/BigWigSet/',
        'dataset'   => 'MyData',
        'position'  => 5,
        'extend'    => 1000,
     );
  }

Feature Export

These methods allow the feature to be exported in industry standard formats, including the BED format and the GFF format. Both methods return a formatted tab-delimited text string suitable for printing to file. The string does not include a line ending character.

These methods rely on coordinates being present in the source table. If the row feature represents a database item, the feature() method should be called prior to these methods, allowing the feature to be retrieved from the database and coordinates obtained.

bed_string(%args)

Returns a BED formatted string. By default, a 6-element string is generated, unless otherwise specified. Pass an array of key values to control how the string is generated. The following arguments are supported.

bed => <integer>

Specify the number of BED elements to include. The number of elements correspond to the number of columns in the BED file specification. A minimum of 3 (chromosome, start, stop) is required, and maximum of 6 is allowed (chromosome, start, stop, name, score, strand).

chromo => <text>
seq_id => <text>
start => <integer>
stop => <integer>
end => <integer>
strand => $strand

Provide alternate values from those defined or missing in the current row Feature. Note that start values are automatically converted to 0-base by subtracting 1.

name => <text>

Provide alternate or missing name value to be used as text in the 4th column. If no name is provided or available, a default name is generated.

score => <number>

Provide a numerical value to be included as the score. BED files typically use integer values ranging from 1..1000.

gff_string(%args)

Returns a GFF3 formatted string. Pass an array of key values to control how the string is generated. The following arguments are supported.

chromo => <text>
seq_id => <text>
start => <integer>
stop => <integer>
end => <integer>
strand => $strand

Provide alternate values from those defined or missing in the current row Feature.

source => <text>

Provide a text string to be used as the source_tag value in the 2nd column. The default value is null ".".

primary_tag => <text>

Provide a text string to be used as the primary_tag value in the 3rd column. The default value is null ".".

type => <text>

Provide a text string. This can be either a "primary_tag:source_tag" value as used by GFF based BioPerl databases, or "primary_tag" alone.

score => <number>

Provide a numerical value to be included as the score. The default value is null ".".

name => <text>

Provide alternate or missing name value to be used as the display_name. If no name is provided or available, a default name is generated.

attributes => [index],

Provide an anonymous array reference of one or more row Feature indices to be used as GFF attributes. The name of the column is used as the GFF attribute key.

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.