The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::GeneTools - SeqFeature agnostic methods for working with gene models

SYNOPSIS

    use Bio::ToolBox::GeneTools qw(:all);
    
    my $gene; # a SeqFeatureI compliant gene object obtained elsewhere
              # for example, from Bio::DB::SeqFeature::Store database
              # or parsed from a GFF3, GTF, or UCSC-style gene table using 
              # Bio::ToolBox parsers
    
    if (is_coding($gene)) { # boolean test
        
        # collect all exons from all transcripts in gene
        my @exons = get_exons($gene);
        
        # find just the alternate exons used only once
        my @alternate_exons = get_alt_exons($gene);
        
        # collect UTRs, which may not be defined in the original source
        my @utrs;
        foreach my $t (get_transcripts($gene)) {
                my @u = get_utrs($t);
                push @utrs, @u;
        }
    }

DESCRIPTION

This module provides numerous exportable functions for working with gene SeqFeature models. This assumes that the gene models follow the BioPerl Bio::SeqFeatureI convention with nested SeqFeature objects representing the gene, transcript, and exons. For example,

    gene
      transcript
        exon
        CDS

Depending upon how the SeqFeatures were generated or defined, subfeatures may or may not be defined or be obvious. For example, UTRs or introns may not be present. Furthermore, the primary_tag or type may not follow Sequence Ontology terms. Regular expressions are deployed to handle varying naming schemes and exceptions.

These functions should work with most or all Bio::SeqFeatureI compliant objects. It has been tested with Bio::ToolBox::SeqFeature, Bio::SeqFeature::Lite, and Bio::DB::SeqFeature classes.

New SeqFeature objects that are generated use the same class for simplicity and expediency.

METHODS

Function Import

None of the functions are exported by default. Specify which ones you want when you import the module. Alternatively, use one of the tags below.

:all

Import all of the methods.

:exon

Import all of the exon methods, including "get_exons", "get_alt_exons", "get_common_exons", "get_uncommon_exons", and "get_alt_common_exons".

:intron

Import all of the intron methods, including "get_introns", "get_alt_introns", "get_common_introns", "get_uncommon_introns", and "get_alt_common_introns".

:transcript

Import the transcript related methods, including "get_transcripts", "get_transcript_length", and "collapse_transcripts".

:cds

Import the CDS pertaining methods, including "is_coding", "get_cds", "get_cdsStart", "get_cdsEnd", "get_transcript_cds_length", and "get_utrs".

:export

Import all of the export methods, including "gff_string", "gtf_string", "ucsc_string", and "bed_string";

:filter

Import all of the transcript filter methods, including "filter_transcript_biotype", "filter_transcript_gencode_basic", and "filter_transcript_support_level".

Exon Methods

Functions to get a list of exons from a gene or transcript

get_exons
        my @exons = get_exons($gene);
        my @exons = get_exons($transcript);

This will return an array or array reference of all the exon subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined exons can be assembled from CDS and/or UTR subfeatures. Exons are sorted by start coordinate.

get_alt_exons
        my @alternate_exons = get_alt_exons($gene);

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used only once in all of the transcripts.

get_common_exons
        my @common_exons = get_common_exons($gene);

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in all of the transcripts.

get_uncommon_exons
        my @uncommon_exons = get_uncommon_exons($gene);

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.

get_alt_common_exons
        my %exon_hash = get_alt_common_exons($gene);

This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the exons for that category. The "common" will be all common exons, "uncommon" will be uncommon exons (used more than once but less than all), and each transcript ID will include their specific alternate exons (used only once).

For genes with only a single transcript, all exons will be marked as "common" for simplicity, although technically they could all be considered "alternate" since they're only used once.

Intron Methods

Functions to get a list of introns from a gene or transcript. Introns are not usually defined in gene annotation files, but are inferred from the exons and total gene or transcript length. In this case, new SeqFeature elements are generated for each intron.

get_introns
        my @introns = get_introns($gene);
        my @introns = get_introns($transcript);

This will return an array or array reference of all the intron subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined introns can be assembled from CDS and/or UTR subfeatures. Introns are sorted by start coordinate.

get_alt_introns
        my @alternate_introns = get_alt_introns($gene);

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used only once in all of the transcripts.

get_common_introns
        my @common_introns = get_common_introns($gene);

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in all of the transcripts.

get_uncommon_introns
        my @uncommon_introns = get_uncommon_introns($gene);

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.

get_alt_common_introns
        my %intron_hash = get_alt_common_introns($gene);

This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the introns for that category. The "common" will be all common introns, "uncommon" will be uncommon introns (used more than once but less than all), and each transcript ID will include their specific alternate introns (used only once).

For genes with only a single transcript, all introns will be marked as "common" for simplicity, although technically they could all be considered "alternate" since they're only used once.

Transcript Methods

These methods work on transcripts, typically alternate transcripts from a gene SeqFeature.

get_transcripts
        my @transcripts = get_transcripts($gene);

Returns an array or array reference of the transcripts associated with a gene feature.

collapse_transcripts
        my $new_transcript = collapse_transcripts($gene);
        my $new_transcript = collapse_transcripts($transcript1, $transcript2);

This method will collapse all of the transcripts associated with a gene SeqFeature into a single artificial transcript, merging exons as necessary to maximize exon length and minimize introns. This is useful when performing, for example, RNASeq analysis on genes. A single SeqFeature transcript object is returned containing the merged exon subfeatures.

Pass either a gene or a list of transcripts to collapse.

get_transcript_length
        my $length = get_transcript_length($transcript);

Calculates and returns the transcribed length of a transcript, i.e the sum of its exon lengths. Warning! If you pass a gene object, you will get the maximum of all transcript exon lengths, which may not be what you anticipate!

CDS methods

These methods calculate values related to the coding sequence of the mRNA transcripts.

is_coding
        if( is_coding($transcript) ) {
                # do something
        }

This method will return a boolean value (1 or 0) if the passed transcript object appears to be a coding transcript. GFF and GTF files are not always immediately clear about the type of transcript; there are (unfortunately) multiple ways to encode the feature as a protein coding transcript: primary_tag, source_tag, GFF attribute, presence of CDS subfeatures, etc. This method checks all of these possibilities. Note: If you pass a multi-transcript gene, only one transcript need to be coding to pass a true value.

get_cds
        my @cds = get_cds($transcript);

Returns the CDS subfeatures of the given transcript, if they are defined. Returns either an array or array reference.

get_cdsStart
        my $start = get_cdsStart($trancript);

Returns the start coordinate of the CDS for the given transcript. Note that this is the leftmost (smallest) coordinate of the CDS and not necessarily the coordinate of the start codon, similar to what the UCSC gene tables report. Use the transcript strand to determine the 5' end.

get_cdsEnd
        my $end = get_cdsEnd($trancript);

Returns the stop coordinate of the CDS for the given transcript. Note that this is the rightmost (largest) coordinate of the CDS and not necessarily the coordinate of the stop codon, similar to what the UCSC gene tables report. Use the transcript strand to determine which is the 3' and 5' end.

get_start_codon
        my $start_codon = get_start_codon($trancript);

Returns a SeqFeature object representing the start codon. If one is not explicitly defined in the hierarchy, then a new object is generated.

get_stop_codon
        my $stop_codon = get_stop_codon($transcript);

Returns a SeqFeature object representing the stop codon. If one is not defined in the hierarchy, then a new object is created. Note that this assumes that the stop codon is inclusive to the defined CDS, which is the case with GFF3 and UCSC gene table derived features. On the other hand, features derived from GTF is defined with the stop codon exclusive to the CDS. This shouldn't matter with GTF, however, since GTF usually explicitly includes stop codon features, whereas the other two formats do not.

get_transcript_cds_length
        my $length = get_transcript_cds_length($transcript);

Calculates and returns the length of the coding sequence for a transcript, i.e. the sum of the CDS lengths.

get_utrs
        my @utrs = get_utrs($trancript);

Returns both 5' and 3' untranslated regions of the transcript. If these are not defined in the SeqFeature subfeature hierarchy, then the coordinates will be determined from from the exon and CDS subfeatures, if available, and new SeqFeature objects generated. Non-coding transcripts will not return anything.

get_5p_utrs
        my @5p_utrs = get_5p_utrs($trancript);

Returns only the 5' untranslated regions of the transcript.

get_3p_utrs($transcript)
        my @3p_utrs = get_3p_utrs($trancript);

Returns only the 3' untranslated regions of the transcript.

Export methods

These methods are used for exporting a gene and/or transcript model into a text string based on the specified format.

gff_string
        my $string .= gff_string($gene, 1);
        my $string .= gff_string($transcript, 1);

This is just a convenience method. SeqFeature objects based on Bio::SeqFeature::Lite, Bio::DB::SeqFeature, or Bio::ToolBox::SeqFeature have a gff_string method, and this will simply call that method. SeqFeature objects that do not have this method will, of course, cause the script to terminate.

Pass the seqfeature object and a boolean value to recursively append all subfeatures (e.g. exons) to the string. In most cases, this will generate a GFF3-style string.

Bio::ToolBox::Data::Feature also provides a simplified gff_string method.

gtf_string
        my $string .= gtf_string($gene);
        my $string .= gtf_string($transcript);

This will export a gene or transcript model as a series of GTF formatted text lines, following the defined Gene Transfer Format (also known as GFF version 2.5). It will ensure that each feature is properly tagged with the gene_id and transcript_id attributes.

This method will automatically recurse through all subfeatures.

ucsc_string
        my $string = ucsc_string($gene);

This will export a gene or transcript model as a refFlat formatted Gene Prediction line (11 columns). See http://genome.ucsc.edu/FAQ/FAQformat.html#format9 for details. Multiple transcript genes are exported as multiple text lines concatenated together.

bed_string
        my $string = bed_string($gene);

This will export a gene or transcript model as a UCSC Bed formatted transcript line (12 columns). See http://genome.ucsc.edu/FAQ/FAQformat.html#format1 for details. Multiple transcript genes are exported as multiple text lines concatenated together. Note that gene information is not preserved with Bed12 files; only the transcript name is used. The RGB value is set to 0.

Filter methods

These methods are used to filter genes.

filter_transcript_support_level
        my $new_gene = filter_transcript_support_level($gene, 'best2');
        my @good_transcripts = filter_transcript_support_level(\@transcripts);

This will filter a gene object for transcripts that match or exceed the provided transcript support level. This assumes that the transcripts contain the attribute tag 'transcript_support_level', which are present in Ensembl provided GFF3 and GTF annotation files. The values are a digit (1-5), or 'NA', where 1 is experimentally supported and 5 is entirely predicted with no experimental evidence. See Ensembl TSL glossary entry for details.

Pass a gene SeqFeature object with one or more transcript subfeatures. Alternatively, an array reference of transcripts could be passed as well.

A level may be provided as a second argument. The default is 'best'.

best

Only the transcripts with the highest existing value will be retained.

best<digit>

All transcripts up to the indicated level are retained. For example, 'best3' would indicate that transcripts with support levels 1, 2, and 3 would be retained.

<digit>

Only transcripts at the given level are retained.

NA

Only transcripts with 'NA' as the value are retained. These are typically pseudogenes or single-exon transcripts.

If none of the transcripts have the attribute, then all are returned (nothing is filtered).

If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.

filter_transcript_gencode_basic
        my $new_gene = filter_transcript_gencode_basic($gene);
        my @good_transcripts = filter_transcript_gencode_basic(\@transcripts);

This will filter a gene object for transcripts for the Ensembl GENCODE tag "basic", which indicates that a transcript is tagged as GENCODE Basic transcript.

If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.

filter_transcript_biotype
        my $new_gene = filter_transcript_gencode_basic($gene, $biotype);
        my @good_transcripts = filter_transcript_gencode_basic(\@transcripts, 'miRNA');

This will filter a gene object for transcripts for specific biotype values using the transcript_biotype or biotype attribute tags, commonly found in Ensembl annotation.

If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.

SEE ALSO

Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::ucsc, Bio::ToolBox::parser::gff, Bio::ToolBox::parser::bed, Bio::Tools::GFF, Bio::SeqFeature::Lite, Bio::DB::SeqFeature, Bio::SeqFeatureI

AUTHOR

 Timothy J. Parnell, PhD
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.