The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::parser::gff - parse GFF3, GTF, and GFF files

DESCRIPTION

This module parses a GFF file into SeqFeature objects. It natively handles GFF3, GTF, and general GFF files.

For both GFF3 and GTF files, fully nested gene models, typically gene => transcript => (exon, CDS, etc), may be built using the appropriate attribute tags. For GFF3 files, these include ID and Parent tags; for GTF these include gene_id and transcript_id tags.

For GFF3 files, any feature without a Parent tag is assumed to be a parent. Children features referencing a parent feature that has not been loaded are considered orphans. Orphans are attempted to be re-associated with missing parents after the file is completely parsed. Any orphans left may be collected. Files with orphans are considered poorly formatted or incomplete and should be fixed. Multiple parentage, for example exons shared between different transcripts of the same gene, are fully supported.

Embedded Fasta sequences are ignored, as are most comment and pragma lines.

The SeqFeature objects that are returned are Bio::SeqFeature::Lite objects. Refer to that documentation for more information.

SYNOPSIS

  use Bio::ToolBox::parser::gff;
  my $filename = 'file.gff3';
  
  my $parser = Bio::ToolBox::parser::gff->new($filename) or 
        die "unable to open gff file!\n";
  
  while (my $feature = $parser->next_top_feature() ) {
        # each $feature is a Bio::SeqFeature::Lite object
        my @children = $feature->get_SeqFeatures();
  }

METHODS

Initializing the parser.

new()

Initialize a new gff parser object. Pass a single value (a GFF file name) to open a file. Alternatively, pass an array of key value pairs to control how the file is parsed. Options include the following.

file

Provide a GFF file name to be parsed. It should have a gff, gtf, or gff3 file extension. The file may be gzip compressed.

version

Specify the version. Normally this is not needed, as version can be determined either from the file extension (in the case of gtf and gff3) or from the ##gff-version pragma at the top of the file. Acceptable values include 1, 2, 2.5 (gtf), or 3.

skip

Pass an anonymous array of primary_tag values to be skipped from the GFF file when parsing into SeqFeature objects. For example, some subfeatures can be skipped for expediency when they known in advance not to be needed. See skip() below.

open_file($file)

Pass the name of a GFF file to be parsed. The file may optionally be gzipped (.gz extension). Do not open a new file when one has already opened a file. Create a new object for a new file, or concatenate the GFF files.

fh()
fh($filehandle)

This method returns the IO::File object of the opened GFF file. A new file may be parsed by passing an opened IO::File or other object that inherits IO::Handle methods.

version

Set or get the GFF version of the current file. Acceptable values include 1, 2, 2.5 (gtf), or 3.

skip(@types)

Pass an array of primary_tag values that should be skipped during parsing. This can simplify and speed up parsing if certain types of subfeatures are known in advance not to be needed. Only exact matches are allowed. Best if this method is called prior to file parsing. This method also returns a list of the primary_tag values to be skipped. Examples include

  • CDS

  • five_prime_UTR

  • three_prime_UTR

  • start_codon

  • stop_codon

Feature retrieval

The following methods parse the GFF file lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.

next_feature()

This method will return a Bio::SeqFeature::Lite object representation of the next feature in the file. Parent - child relationships are NOT assembled. This is best used with simple GFF files with no hierarchies present. This may be used in a while loop until the end of the file is reached. Pragmas are ignored and comment lines and sequence are automatically skipped.

next_top_feature()

This method will return a top level parent Bio::SeqFeature::Lite object assembled with child features as sub-features. For example, a gene object with mRNA subfeatures, which in turn may have exon and/or CDS subfeatures. Child features are assembled based on the existence of proper Parent attributes in child features. If no Parent attributes are included in the GFF file, then this will behave as next_feature().

Child features (those containing a Parent attribute) are associated with the parent feature. A warning will be issued about lost children (orphans). Shared subfeatures, for example exons common to multiple transcripts, are associated properly with each parent. An opportunity to rescue orphans is available using the orphans() method.

Note that subfeatures may not necessarily be in ascending genomic order when associated with the feature, depending on their order in the GFF3 file and whether shared subfeatures are present or not. When calling subfeatures in your program, you may want to sort the subfeatures. For example

  my @subfeatures = map { $_->[0] }
                    sort { $a->[1] <=> $b->[1] }
                    map { [$_, $_->start] }
                    $parent->get_SeqFeatures;
top_features()

This method will return an array of the top (parent) features defined in the GFF file. This is similar to the next_top_feature() method except that all features are returned at once.

Other methods

Additional methods for working with the parser object and the parsed SeqFeature objects.

parse_file

Parses the file into memory.

find_gene

Pass a gene name, or an array of key = values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a gene already loaded into memory. Only really successful if the entire file is loaded into memory. Genes with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.

orphans

This method will return an array of orphan SeqFeature objects that indicated they had a parent but said parent could not be found. Typically, this is an indication of an incomplete or malformed GFF3 file. Nevertheless, it might be a good idea to check this after retrieving all top features.

comments

This method will return an array of the comment or pragma lines that may have been in the parsed file. These may or may not be useful.

from_gff_string($string)

This method will parse a GFF, GTF, or GFF3 formatted string or line of text and return a Bio::SeqFeature::Lite object.

unescape($text)

This method will unescape special characters in a text string. Certain characters, including ";" and "=", are reserved for GFF3 formatting and are not allowed, thus requiring them to be escaped.

is_coding($transcript)

This method will return a boolean value if the passed transcript object appears to be a coding transcript. GFF and GTF files are not always immediately clear about the type of transcript; there are (unfortunately) multiple ways to encode the feature as a protein coding transcript: primary_tag, source_tag, attribute, CDS subfeatures, etc. This method tries to determine this.

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.