The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::parser::ucsc - Parser for UCSC genePred, refFlat, etc formats

SYNOPSIS

  use Bio::ToolBox::parser::ucsc;
  
  ### A simple transcript parser
  my $ucsc = Bio::ToolBox::parser::ucsc->new('file.genePred');
  
  ### A full fledged gene parser
  my $ucsc = Bio::ToolBox::parser::ucsc->new(
        file      => 'ensGene.genePred',
        do_gene   => 1,
        do_cds    => 1,
        do_utr    => 1,
        ensname   => 'ensemblToGene.txt',
        enssrc    => 'ensemblSource.txt',
  );
  
  ### Retrieve one transcript line at a time
  my $transcript = $ucsc->next_feature;
  
  ### Retrieve one assembled gene at a time
  my $gene = $ucsc->next_top_feature;
  
  ### Retrieve array of all assembled genes
  my @genes = $ucsc->top_features;
  
  # Each gene or transcript is a SeqFeatureI compatible object
  printf "gene %s is located at %s:%s-%s\n", 
    $gene->display_name, $gene->seq_id, 
    $gene->start, $gene->end;
  
  # Multiple transcripts can be assembled into a gene
  foreach my $transcript ($gene->get_SeqFeatures) {
    # each transcript has exons
    foreach my $exon ($transcript->get_SeqFeatures) {
      printf "exon is %sbp long\n", $exon->length;
    }
  }
  
  # Features can be printed in GFF3 format
  $gene->version(3);
  print STDOUT $gene->gff_string(1); 
   # the 1 indicates to recurse through all subfeatures
  

DESCRIPTION

This is a parser for converting UCSC-style gene prediction flat file formats into BioPerl-style Bio::SeqFeatureI compliant objects, complete with nested objects representing transcripts, exons, CDS, UTRs, start- and stop-codons. Full control is available on what to parse, e.g. exons on, CDS and codons off. Additional gene information can be added by supplying additional tables of information, such as common gene names and descriptions, available from the UCSC repository.

Table formats supported

Supported files are tab-delimited text files obtained from UCSC and described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9. Formats are identified by the number of columns, rather than specific file extensions, column name headers, or other metadata. Therefore, unmodified tables should only be used for correct parsing. Some errors are reported for incorrect lines. Unadulterated files can safely be downloaded from http://hgdownload.soe.ucsc.edu/downloads.html. Files obtained from the UCSC Table Browser can also be used with caution. Files may be gzip compressed.

File formats supported include the following.

  • Gene Prediction (genePred), 10 columns

  • Gene Prediction with RefSeq gene Name (refFlat), 11 columns

  • Extended Gene Prediction (genePredExt), 15 columns

  • Extended Gene Prediction with bin (genePredExt), 16 columns

  • knownGene table, 12 columns

Supplemental information

The UCSC gene prediction tables include essential information, but not detailed information, such as common gene names, description, protein accession IDs, etc. This additional information can be associated with the genes or transcripts during parsing if the appropriate tables are supplied. These tables can be obtained from the UCSC download site http://hgdownload.soe.ucsc.edu/downloads.html.

Supported tables include the following.

  • refSeqStatus, for refGene, knownGene, and xenoRefGene tables

  • refSeqSummary, for refGene, knownGene, and xenoRefGene tables

  • ensemblToGeneName, for ensGene tables

  • ensemblSource, for ensGene tables

  • kgXref, for knownGene tables

Implementation

For an implementation of this module to generate GFF3 formatted files from UCSC data sources, see the Bio::ToolBox script ucsc_table2gff3.pl.

METHODS

Initalize the parser object

new

Initiate a UCSC table parser object. Pass a single value (a table file name) to open a table and parse its objects. Alternatively, pass an array of key value pairs to control how the table is parsed. Options include the following.

file
table

Provide a file name for a UCSC gene prediction table. The file may be gzip compressed.

source

Pass a string to be added as the source tag value of the SeqFeature objects. The default value is 'UCSC'. If the file name has a recognizable name, such as 'refGene' or 'ensGene', it will be used instead.

do_gene

Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.

-item do_exon

do_cds
do_utr
do_codon

Pass a boolean (1 or 0) value to parse certain subfeatures, including exon, CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features. Default is false.

do_name

Pass a boolean (1 or 0) value to assign names to subfeatures, including exons, CDSs, UTRs, and start and stop codons. Default is false.

share

Pass a boolean (1 or 0) value to recycle shared subfeatures (exons and UTRs) between multiple transcripts of the same gene. This results in reduced memory usage, and smaller exported GFF3 files. Default is true.

refseqsum
refseqstat
kgxref
ensembltogene
ensemblsource

Pass the appropriate file name for additional information.

class

Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.

Modify the parser object

These methods set or retrieve parameters, and load supplemental files and new tables.

source
do_gene
do_exon
do_cds
do_utr
do_codon
do_name
share

These methods retrieve or set parameters to the parsing engine, same as the options to the new method.

fh

Set or retrieve the file handle of the current table. This module uses IO::Handle objects. Be careful manipulating file handles of open tables!

open_file

Pass the name of a new table to parse. Existing gene models loaded in memory, if any, are discarded. Counts are reset to 0. Supplemental tables are not discarded.

load_extra_data($file, $type)
        my $file = 'hg19_refSeqSummary.txt.gz';
        my success = $ucsc->load_extra_data($file, 'summary');

Pass two values, the file name of the supplemental file and the type of supplemental data. Values can include the following

  • refseqstatus or status

  • refseqsummary or summary

  • kgxref

  • ensembltogene or ensname

  • ensemblsource or enssrc

The number of transcripts with information loaded from the supplemental data file is returned.

Feature retrieval

The following methods parse the table lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.

next_feature

This will read the next line of the table and parse it into a gene or transcript object. However, multiple transcripts from the same gene are not assembled together under the same gene object.

next_top_feature

This method will return all top features (typically genes), with multiple transcripts of the same gene assembled under the same gene object. Transcripts are assembled together if they share the same gene name and the transcripts overlap. If transcripts share the same gene name but do not overlap, they are placed into separate gene objects with the same name but different primary_id tags. Calling this method will parse the entire table into memory (so that multiple transcripts may be assembled), but only one object is returned at a time. Call this method repeatedly using a while loop to get all features.

top_features

This method is similar to "next_top_feature", but instead returns an array of all the top features.

Other methods

Additional methods for working with the parser object and the parsed SeqFeature objects.

parse_table

Parses the table into memory. If a table wasn't provided using the "new" or "open_file" methods, then a filename can be passed to this method and it will automatically be opened for you.

find_gene
        my $gene = $ucsc->find_gene(
                display_name => 'ABC1',
                primary_id   => 'gene000001',
        );

Pass a gene name, or an array of key => values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a gene already loaded into memory. Only really successful if the entire table is loaded into memory. Genes with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.

counts

This method will return a hash of the number of genes and RNA types that have been parsed.

typelist

This method will return a comma-delimited list of the feature types or primary_tags found in the parsed file. Returns a generic list if a file has not been parsed.

from_ucsc_string

A bare bones method that will convert a tab-delimited text line from a UCSC formatted gene table into a SeqFeature object for you. Don't expect alternate transcripts to be assembled into genes.

seq_ids

Returns an array or array reference of the names of the chromosomes or reference sequences present in the table.

seq_id_lengths

Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is inferred by the greatest gene end position.

Bio::ToolBox::parser::ucsc::builder

This is a private module that is responsible for building SeqFeature objects from UCSC table lines. It is not intended for general public use.

SEE ALSO

Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::gff

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.