The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::DB::USeq - Read USeq archive database files

SYNOPSIS

    use Bio::DB::USeq;
    my $useq = Bio::DB::USeq->new('file.useq') or 
        die "unable to open file.useq!\n";
    
    # sequence IDs
    my @seq_ids = $useq->seq_ids;
    my $length = $useq->length($chr); # approximate, not exact
    
    ### Retrieving features
    # all features or observations across chromosome I
    my @features = $useq->features(
        -seq_id     => 'chrI',
        -type       => 'region', 
    );
    
    # same thing using a simple form
    # use an array of (chromosome, start, stop, strand)
    my @features = $useq->features('chrI');
    
    # same thing with a memory efficient iterator
    my $iterator = $useq->get_seq_stream(
        -seq_id     => 'chrI',
    );
    while (my $f = $iterator->next_seq) {
        # each feature $f supports most SeqFeatureI methods
    }
    
    
    ### Retrieving simple scores
    my @scores = $useq->scores(
        -seq_id     => 'chrI',
        -start      => 1000,
        -end        => 2000
    );
    
    
    ### Same methods as above after defining an interval first
    my $segment = $useq->segment(
        -seq_id     => 'chrI',
        -start      => 1000,
        -end        => 2000,
    );
    my @scores   = $segment->scores;
    my @features = $segment->features;
    my $iterator = $segment->get_seq_stream;
    
    
    ### Efficient retrieval of positioned scores in 100 bins 
    # compatible with Bio::Graphics 
    my ($wig) = $useq->features(
        # assuming unstranded data here, otherwise two wiggle objects 
        # would be returned, one for each strand
        -seq_id     => 'chrI',
        -start      => 1000,
        -end        => 2000,
        -type       => 'wiggle:100',
    );
    my @bins = $wig->wiggle;
    my @bins = $wig->coverage; # same thing
    my ($bins) = $wig->get_tag_values('coverage'); # same thing
    
    
    ### Statistical summaries of intervals
    # compatible with Bio::Graphics
    my ($summary) = $useq->features(
        # assuming unstranded data here, otherwise two summaries 
        # would be returned, one for each strand
        -seq_id     => 'chrI',
        -start      => 1000,
        -end        => 2000,
        -type       => 'summary',
    );
    my $stats = $summary->statistical_summary(10);
    foreach (@$stats) {
        my $max = $_->{maxVal};
        my $mean = Bio::DB::USeq::binMean($_);
    }
    
    
    ### Stranded data using an iterator
    # could be used with either wiggle or summary features
    my $stream = $useq->get_seq_stream(
        -seq_id     => 'chrI',
        -start      => 1000,
        -end        => 2000,
        -type       => 'wiggle:100',
    );
    my ($forward, $reverse);
    if ($useq->stranded) {
        $forward = $stream->next_seq;
        $reverse = $stream->next_seq;
    }
    else {
        $forward = $stream->next_seq;
    }
    

DESCRIPTION

Bio::DB::USeq is a BioPerl style adaptor for reading USeq files. USeq files are compressed, indexed data files supporting modern bioinformatic datasets, including genomic points, scores, and intervals. More information about the USeq software package can be found at https://github.com/HuntsmanCancerInstitute/USeq, including a description of the USeq data archive format.

USeq files are typically half the size of corresponding bigBed and bigWig files, due to a compact internal format and lack of internal zoom data. This adaptor, however, can still return statistics across different zoom levels in the same manner as big files, albeit at a cost of calculating these in realtime.

Generating useq files

USeq files may be generated using tools from the USeq package, available at https://github.com/HuntsmanCancerInstitute/USeq. They may be generated from native Bar files, text Wig files, text Bed files, and UCSC bigWig and bigBed file formats.

Compatibility

The adaptor follows most conventions of other BioPerl-style Bio::DB adaptors. Observations or features in the useq file archive are returned as SeqFeatureI compatible objects.

Coordinates consumed and returned by the adaptor are 1-based, consistent with BioPerl convention. This is not true of the useq file itself, which uses the interbase coordinate system.

Unlike wig and bigWig files, useq file archives support stranded data, which can make data collection much simpler for complex experiments.

See below for GBrowse compatibility.

Limitations

This adaptor is read only. USeq files, in general, are not modified or written with data. The exceptions are chromosome or global statistics are written to the archiveReadMe.txt file inside the zip archive to cache for future queries.

No support for genomic sequence is included. Users who need access to genomic sequence should seek an alternative BioPerl adaptor, such as Bio::DB::Fasta.

Useq files do not have a native concept of type, primary_tag, or source attributes, as expected with GFF-based database adaptors. The features method does support special types (see below).

Requirements

The adaptor is a Perl-only implementation. It only requires the Archive::Zip module for opening and reading the useq file archive. BioPerl is required for working with SeqFeatures objects generated from useq file observations.

METHODS

Initializing the Bio::DB::USeq object

These are class methods for creating and working with the Bio::DB::USeq object itself.

new
new($file)
new(-useq => $file)

This will create a new Bio::DB::USeq object. Optionally pass the path to a useq file archive to open immediately. There is not much to do unless you open a file.

Named arguments may be used to specify the file. Either -useq or -file may be used.

open($file)

Open a useq file archive. Useq files typically use the .useq file extension. Returns true if successful.

DO NOT open a subsequent useq file archive when one has already been opened. Please create a new object to open a new file.

clone

Force the object to re-open the useq archive file under a new file handle to make it clone safe. Do this in the child process before collecting data.

zip

Return the Archive::Zip object representing the useq file archive. Generally not recommended unless you know what you are doing.

General information about the useq file

These are class methods for obtaining general information or metadata attributes regarding the contents of the file.

stranded

Return true or false (1 or 0) indicating whether the contents of the useq file archive are recorded in a stranded fashion.

attributes

Return an array of available attribute keys that were recorded in the useq file archiveReadMe.txt member. These are key = value pairs representing metadata for the useq file.

attribute($key)

Return the metadata attribute value for the specified key. These are recorded in the useq file archiveReadMe.txt member. Returns undefined if the key does not exist.

type

Return the useq file metadata dataType value.

genome

Return the useq file metadata versionedGenome value.

version

Return the useq file metadata useqArchiveVersion value.

Chromosome Information

These are methods to obtain information about the chromosomes or reference sequences represented in the file archive.

Note that generating score statistics across one or more chromosomes may be computationally expensive. Therefore, chromosome statistics, once calculated, are cached in the useq file metadata for future reference. This necessitates writing to the useq zip archive. This is currently the only supported method for modifying the useq zip archive.

seq_ids

Return an array of the chromosome or sequence identifiers represented in the useq file archive. The names are sorted ASCIIbetically before they are returned.

length($seq_id)

Return the length of the provided chromosome or sequence identifier. Note that this may not be the actual length in the reference genome, but rather the last recorded position of an observation for that chromosome. Hence, it should be used only as an approximation.

chr_mean($seq_id)

Return the mean score value across the entire chromosome.

chr_stdev($seq_id)

Return the score standard deviation across the entire chromosome.

chr_stats($seq_id)

Return a statistical summary across the entire chromosome. This is an anonymous hash with five keys: validCount, sumData, sumSquares, minVal, and maxVal. These are equivalent to the statistical summaries generated by the Bio::DB::BigWig adaptor.

global_mean

Return the mean score value across all chromosomes.

global_stdev

Return the mean score value across all chromosomes.

global_stats

Return a statistical summary across all chromosomes.

Data accession

These are the primary methods for working with data contained in useq file archive. These should be familiar to most BioPerl users.

features

Returns an array or array reference of SeqFeatureI compatible objects overlapping a given genomic interval.

Coordinates of the interrogated regions must be supplied. At a minimum, the seq_id must be supplied. A start of 1 and an end corresponding to the length of the seq_id is used if not directly provided. Coordinates may be specified in two different manners, either as a list of (seq_id, start, end, strand) or as one or more keyed values.

    my @features = $useq->features($seq_id, $start, $end);
    
    @features = $useq->features(
        -seq_id     => $seq_id,
        -start      => $start,
        -end        => $end,
    );

If the -iterator argument is supplied with a true value, then an iterator object is returned. See get_seq_stream() for details.

Bio::DB::USeq supports four different feature types. Feature types may be specified using the -type argument.

region
interval
observation

The default feature type if the type argument is not specified. These are SeqFeatureI compatible objects representing observations in the useq file archive. These are compatible with the iterator. SeqFeature observations are returned in genomic order.

chromosome

Returns SeqFeatureI compatible objects representing the reference sequences (chromosomes) listed in the useq file archive. These are not compatibile with the iterator.

wiggle
wiggle:$bins
coverage
coverage:$bins

Returns an array of SeqFeatureI compatible objects for each strand requested. If the useq file contains stranded data, and no strand is requested, then two objects will be returned representing each strand.

Each object contains an array representing scores across the requested coordinates. This object is designed to be backwards compatible with coverage features from the Bio::DB::Sam adaptor for use with Bio::Graphics and GBrowse. Note that only scores are returned, not a true depth coverage in the sense of the Bio::DB::Sam coverage types.

By default, the wiggle or coverage array is provided at 1 bp resolution. To improve efficiency with large regions, the wiggle array may be limited by using a bin option, where the interrogated interval is divided into the number of bins requested.

To retrieve the scores, call the wiggle() or coverage() method.

For example, to request wiggle scores in 100 equal bins across the interval, see the following example. The wiggle and coverage types are synonymous.

  my ($wiggle) = $useq->features(
      -seq_id       => $chromosome,
      -start        => $start,
      -end          => $end,
      -type         => 'wiggle:100',
  );
  my @scores = $wiggle->wiggle;
  @scores    = $wiggle->coverage;

Wiggle objects may also be obtained with a get_seq_stream() iterator objects.

summary
summary:$bins

Returns an array of SeqFeatureI compatibile Summary objects for each strand requested. If the useq file contains stranded data, and no strand is requested, then two objects will be returned representing each strand.

Each Summary object can then be used to call statistical summaries for one or more bins across the interval. Each statistical summary is an anonymous hash with five keys: validCount, sumData, sumSquares, minVal, and maxVal. From these values, a mean and standard deviation may also be calculated.

  my ($summary) = $useq->features(
      -seq_id       => $chromosome,
      -start        => $start,
      -end          => $end,
      -type         => 'summary',
  );
  my @stats  = $summary->statistical_summary(100);
  foreach my $stat (@stats) {
         my $count = $stat->{validCount};
         my $sum   = $stat->{sumData};
         my $mean  = $sum / $count;
  }

Statistical summaries are equivalent to those generated by the Bio::DB::BigWig adaptor and may be used interchangeably. They are compatible with the Bio::Graphics modules.

Summary objects may also be obtained with a get_seq_stream() iterator object.

get_seq_stream

This is a memory efficient data accessor. An iterator object is returned for an interval specified using coordinate values in the same manner as features(). Call the method next_seq() on the iterator object to retrieve the observation SeqFeature objects one at a time. The iterator is compatible with region, wiggle, and summary feature types.

    # establish the iterator object
    my $iterator = $useq->get_seq_stream(
        -seq_id     => $seq_id,
        -start      => $start,
        -end        => $end,
        -type       => 'region',
    );
    
    # retrieve the features one at a time
    while (my $f = $iterator->next_seq) {
        # each feature $f is either a 
        # Bio::DB::USeq::Feature, 
        # a Bio::DB::USeq::Wiggle, or 
        # a Bio::DB::USeq::Summary object 
    }
scores

This is a simplified data accessor that only returns the score values overlapping an interrogated interval, rather than assembling each observation into a SeqFeature object. The scores are not associated with genomic coordinates, and are not guaranteed to be returned in original genomic order.

Provide the interval coordinates in the same manner as the features() method. Stranded data collection is supported.

    my @scores = $useq->scores(
        -seq_id     => $seq_id,
        -start      => $start,
        -end        => $end,
    );
observations

This is a low-level data accessor, similar to features() but returning an array of references to the original USeq observations. Each USeq observation is an anonymous array reference of 2-4 elements: start, stop, score, and text, depending on the stored data type. All coordinates are in 0-based coordinates, unlike high level accessors. Note that while strand is supported, it is not reported for each observation. If observations exist on both strands, then each strand should be searched separately. Observations are not guaranteed to be returned in genomic order.

    my @observations = $useq->observations(
        -seq_id     => $seq_id,
        -start      => $start,
        -end        => $end,
    );
    foreach my $obs (@observations) {
        my $start = $obs->[0] + 1; # convert to 1-based coordinate
        my $stop  = $obs->[1];
        my $score = $obs->[2]; # may not be present
        my $text  = $obs->[3]; # may not be present
    }
segment

This returns a Bio::DB::USeq::Segment object, which is a SeqFeatureI compatible segment object corresponding to the specified coordinates. From this object, one can call the features(), scores(), or get_seq_stream() methods directly. Keyed options or location information need not be provided. Stranded segments are supported.

    my $segment = $useq->segment(
        -seq_id     => $seq_id,
        -start      => $start,
        -end        => $end,
    );
    my @scores   = $segment->scores;
    my @features = $segment->features('wiggle:100');
    my $iterator = $segment->get_seq_stream('region');
get_features_by_location

Convenience method for returning features restricted by location.

get_feature_by_id
get_feature_by_name

Compatibility methods for returning a specific feature or observation in the USeq file. Text fields, if present, are not indexed in the USeq file, preventing efficient searching of names. As a workaround, an ID or name comprised of "$seq_id:$start:$end" may be used, although a direct search of coordinates would be more efficient.

        my $feature = $useq->get_feature_by_id("$seq_id:$start:$end");
        

ADDITIONAL CLASSES

These are additional class object that may be returned by various methods above.

Bio::DB::USeq::Feature objects

These are SeqFeatureI compliant objects returned by the features() or next_seq() methods. They support the following methods.

    seq_id
    start
    end
    strand
    score
    type 
    source (returns the useq archive filename)
    name (chromosome:start..stop)
    Bio::RangeI methods

Additionally, chromosome and global statistics are also available from any feature, as well as from Segment, Wiggle, Iterator, and Summary objects. See the corresponding USeq methods for details.

Bio::DB::USeq::Segment objects

This is a SeqFeatureI compliant object representing a genomic segment or interval. These support the following methods.

features
features($type)
get_seq_stream
get_seq_stream($type)
scores
observations

Direct methods for returning features or scores. Coordinate information need not be provided. See the corresponding Bio::DB::USeq methods for more information.

wiggle
wiggle($bins)
coverage
coverage($bins)
statistical_summary
statistical_summary($bins)

Convenience methods for returning wiggle (coverage) or summary features over the segment. If desired, the number of bins may be specified. See the features() method for more information.

slices

Returns an array of splice member names that overlap this segment. See "USEQ SLICES" for more information.

Bio::DB::USeq::Iterator objects

This is an iterator object for retrieving useq observation SeqFeatureI objects in a memory efficient manner by returning one feature at a time.

next_seq
next_feature

Returns the next feature present in the interrogated interval. Features are generally returned in ascending coordinate order. Returns undefined when no features are remaining in the interval. Features may include either region or wiggle types, depending on how the iterator object was established. See features() and get_seq_stream() methods for more information.

Bio::DB::USeq::Wiggle objects

These are SeqFeatureI compliant objects for backwards compatibility with Bio::Graphics and GBrowse. They support the wiggle() and coverage() methods, which returns an array of scores over the interrogated region. By default, the array is equal to the length of the region (1 bp resolution), or may be limited to a specified number of bins for efficiency. See the features() method for more information.

wiggle
coverage

The scores are stored as an array in the coverage attribute. For convenience, the wiggle() and coverage() methods may be used to retrieve the array or array reference of scores.

statistical_summary

Generate a statistical summary hash for the collected wiggle scores (not the original data). This method is not entirely that useful; best to use the summary feature type in the first place.

chromosome and global statistics

Chromosome and global statistics, including mean and standard deviation, are available from wiggle objects. See the corresponding USeq methods for details.

Bio::DB::USeq::Summary objects

These are SeqFeatureI compliant Summary objects, similar to those generated by the Bio::DB::BigWig database adaptor. As such, they are compatible with Bio::Graphics and GBrowse.

Summary objects can generate statistical summaries over a specified number of bins (default is 1 bin, or the entire requested region). Each statistical summary is an anonymous hash consisting of five keys: validCount, sumData, sumSquares, minVal, and maxVal. From these values, a mean and standard deviation may be calculated.

For convenience, three exported functions are available for calculating the mean and standard deviation from a statistical summary hash. See "EXPORTED FUNCTIONS" for more information.

Use statistical summaries in the following manner.

    my $stats = $summary->statistical_summary(10);
    my $stat  = shift @$stats;
    my $min   = $stat->{minVal};
    my $max   = $stat->{maxVal};
    my $mean  = $stat->{sumData} / $stat->{validCount};
statistical_summary
statistical_summary($bins)

Generate a statistical summary hash for one or more bins across the interrogated region. Provide the number of bins desired. If a feature type of "summary:$bins" is requested through the features() or get_seq_stream() method, then $bins number of bins will be used. The default number of bins is 1.

score

Generate a single statistical summary over the entire region.

chromosome and global statistics

Chromosome and global statistics, including mean and standard deviation, are available from summary objects. See the corresponding USeq methods for details.

EXPORTED FUNCTIONS

Three subroutine functions are available for export to assist in calculating the mean, variance, and standard deviation from statistical summaries. These functions are identical to those from the Bio::DB::BigWig adaptor and may be used interchangeably.

They are not exported by default; they must explicitly listed.

    use Bio::DB::USeq qw(binMean binStdev);
    my $stats = $summary->statistical_summary(10);
    my $stat  = shift @$stats;
    my $mean  = binMean($stat);
    my $stdev = binStdev($stat);
binMean( $stat )

Calculate the mean from a statistical summary anonymous hash.

binVariance( $stat )

Calculate the variance from a statistical summary anonymous hash.

binStdev( $stat )

Calculate the standard deviation from a statistical summary anonymous hash.

USEQ SLICES

Genomic observations are recorded in groups, called slices, of usually 10000 observations at a time. Each slice is a separate zip file member in the useq file archive. These methods are for accessing information about each slice. In general, accessing data through slices is a lower level operation. Users should preferentially use the main data accessors.

The following are Bio::DB::USeq methods available for working with slices.

slices

Returns an array of all the slice member names present in the useq archive file.

slice_feature($slice)

Return a Bio::DB::USeq::Segment object representing the slice interval. The features(), get_seq_stream(), and scores() methods are supported.

slice_seq_id($slice)

Return the chromosome or sequence identifier associated with a slice.

slice_start($slice)

Return the start position of the slice interval.

slice_end($slice)

Return the end position of the slice interval.

slice_strand($slice)

Return the strand of the slice interval.

slice_type($slice)

Return the file type of the slice member. This corresponds to the file extension of the slice zip member and indicates how to parse the binary member. Each letter in the type corresponds to a data type, be it integer, short, floating-point, or text. See the useq file documentation for more details.

slice_obs_number($slice)

Return the number of observations recorded in the slice interval.

GBROWSE COMPATIBILITY

The USeq adaptor has support for Bio::Graphics and GBrowse. It will work with the segments glyph for intervals, the wiggle_xyplot glyph for displaying mean scores, and the wiggle_whiskers glyph for displaying detailed statistics.

Initialize the USeq database adaptor.

    [data1:database]
    db_adaptor    = Bio::DB::USeq
    db_args       = -file /path/to/data1.useq

Displaying simple intervals with the segments glyph.

    [data1_segments]
    database      = data1
    feature       = region
    glyph         = segments
    stranded      = 1

Displaying scores using the wiggle_xyplot glyph. You may set the bins to whatever number is appropriate (in this example, 1000), or leave blank (not recommended, defaults to 1 bp resolution).

    [data1_xyplot]
    database      = data1
    feature       = wiggle:1000
    glyph         = wiggle_xyplot
    graph_type    = histogram
    autoscale     = chromosome

Displaying scores using the wiggle_whiskers glyph. Note that generating statistical summaries are computationally more expensive than simple bins of mean values as with the wiggle feature type.

    [data1_whiskers]
    database      = data1
    feature       = summary
    glyph         = wiggle_whiskers
    graph_type    = histogram
    autoscale     = chromosome

PERFORMANCE

Because the Bio::DB::USeq is implemented as a Perl-only module, performance is subject to the limitations of Perl execution itself and the size of the data that needs to be parsed. In general when collecting score data, requesting scores is the fastest mode of operation, followed by wiggle feature types, and finally summary feature types.

In comparison to UCSC bigWig files, the USeq format is typically much faster when viewing intervals where the entire interval is represented by one or a few internal slices. This is especially true for repeated queries over the same or neighboring intervals, as the slice contents are retained in memory. As the number of internal slices that must be loaded into memory increases, for example querying intervals of hundreds of kilobases in size, performance will begin to lag as each internal slice must be parsed into memory. This is where the UCSC bigWig file format with internal zoom levels of summary statistics can outperform, at the cost of file complexity and size.

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.