The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::db_helper

DESCRIPTION

These are helper subroutines to work with relevant databases that can be accessed through BioPerl modules. These include the Bio::DB::SeqFeature::Store relational database as well as Bio::DB::Fasta, Bio::DB::Bam, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed databases. The primary functions included opening connections to these databases, verifying features found within the database, generating lists of features or genomic intervals for subsequent analysis, and collecting features and/or scores within regions of interest. Collected scores may be summarized using a variety of statistical methods.

When collecting scores or features, the data may hosted in a variety of formats and locations. These include:

SeqFeature::Store Database

Full features may be stored, including genes, transcripts, exons, etc. Simple datasets, such as from microarray, may also be stored as the score value in the source GFF file.

References to local, binary, indexed files may also be included as attributes to features stored in the database. Supported files include binary wig files (.wib, see Bio::Graphics::Wiggle) using the 'wigfile' attribute, or bigWig files using the 'wigfile' or 'bigwigfile' attribute. The attribute values must be full paths.

SeqFeature::Store databases are usually hosted by a relational database server (MySQL or PostGreSQL), SQLite file, or an in-memory database (for small GFF3 files only).

BigWig files

BigWig files are compressed, binary, indexed versions of text wig files and may be accessed either locally or remotely. They support extremely fast score retrieval from any genomic location of any size without sacrificing resolution (spatial and numeric).

Directory of BigWig files

A directory containing one or more BigWig files is assembled into a BigWigSet, allowing for metadata, such as strand, to be associated with BigWig files.

BigBed files

BigBed files are compressed, binary, indexed versions of text BED files and may be accessed either locally or remotely. They support extremely fast score and feature retrieval from any genomic location.

Bam files

Bam files are compressed, binary, indexed versions of the text SAM file, or sequence alignment map. They are used with next generation sequencing technologies. They support individual alignment retrieval as well as read depth coverage.

USeq files

USeq files are compressed, binary, indexed files that support BED type annotations or wig type scores distributed across the genome. They support rapid, random access across the genome and are comparable to both BigWig and BigBed files.

While the functions within this library may appear to be simply a rehashing of the methods and functions in Bio::DB::SeqFeature::Store and other modules, they either provide a simpler function to often used database methodologies or are designed to work intimately with the BioToolBox data format file and data structures (see Bio::ToolBox::file_helper). One key advantage to these functions is the ability to work with datasets that are stranded (transcriptome data, for example).

Historically, this module was initially written to use Bio::DB::GFF for database usage. It has since been re-written to use Bio::DB::SeqFeature::Store database schema. Additional functionality has also been added for other databases and file formats.

Complete usage and examples for the functions are provided below.

USAGE

Call the module at the beginning of your perl script and include the module names to export. None are exported by default.

  use Bio::ToolBox::db_helper qw(
          get_new_feature_list 
          get_feature
  );

This will export the indicated subroutine names into the current namespace. Their usage is detailed below.

EXPORTED SUBROUTINES

open_db_connection($database, $no_cache)

This module will open a connection to a BioPerl style database. It returns an object that represents the connection. Several different types of databases are supported.

Bio::DB::SeqFeature::Store database

These may be represented by a relational database (e.g. MySQL database), a SQLite database file (file.sqlite or file.db), or a single GFF3 file (file.gff) that can be loaded into an in-memory database. In-memory databases should only be used with small files as they demand a lot of memory.

Parameters for connecting to a relational database are stored in the BioToolBox configuration file, .biotoolbox.cfg. These include database adaptors, user name, password, etc. Information regarding the configuration file may be found within the file itself.

Bio::DB::Sam database

A self-contained database represented by a sorted, indexed Bam file (file.bam). See http://samtools.sourceforge.net for more details. Files may be either local or remote (prefixed with http:// or ftp://).

Bio::DB::BigWig database

A self-contained database of scores represented by a BigWig (file.bw). See http://genome.ucsc.edu/goldenPath/help/bigWig.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).

Bio::DB::BigWigSet database

A local or remote directory of one or more BigWig files that can treated collectively as a database. A special text file may be included in the directory to assign metadata to each BigWig file, including attributes such as type, source, display name, strand, etc. See Bio::DB::BigWigSet for more information on the formatting of the metadata file.

Bio::DB::BigBed database

A self-contained database of regions represented by a BigBed (file.bb). See http://genome.ucsc.edu/goldenPath/help/bigBed.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).

Bio::DB::USeq database

A self-contained database file of indexed regions or scores represented by a useq archive file (file.useq). See http://useq.sourceforge.net/useqArchiveFormat.html for more information. Files may only be local.

Bio::DB::Fasta Database

A database of fasta sequences. A single multi-fasta or a directory of fasta files may be specified. The directory or parent directory must be writeable by the user to write a small index file.

Pass the name of a relational database or the path of the database file to the subroutine. The opened database object is returned. If it fails, then an error message should be generated and nothing is returned.

Important! If you are forking your perl process, always re-open your database objects in the child process, and pass a second true value to avoid using the cached database object. By default, opened databases are cached to improve efficiency, but this will be disastrous when crossing forks.

Example:

        my $db_name = 'cerevisiae';
        my $db = open_db_connection($db_name);
        
        my $file = '/path/to/file.bam';
        my $db = open_db_connection($file);
        
        # within a forked child process
        # reusing the same variable to simplify code
        # pass second true value to avoid cache
        $db = open_db_connection($file, 1); 
get_db_name

This subroutine will attempt to get the name of an opened Database object if for some reason it's unknown, i.e. user only provided an opened db object. Only works for some databases, at least those I've bothered to track down and find a usable API call to use.

get_dataset_list

This subroutine will retrieve a list of the available features stored in the database and returns an array of the features' types.

For Bio::DB::SeqFeature::Store databases, the type is represented as "type:source", corresponding to the third and second GFF columns, respectively. The types are sorted alphabetically first by source, then by method.

For Bio::DB::BigWigSet databases, the type, primary_tag, method, or display_name attribute may be used, in that respective order of availability. The list is sorted alphabetically.

Pass either the name of the database or an established database object. Supported databases include both Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases.

Example:

        my $db_name = 'cerevisiae';
        my @types = get_dataset_list($db_name);
verify_or_request_feature_types()

This subroutine will process a list of feature types or data sources to be used for data or feature collection. There are two modes of action. If a list was provided, it will process and verify the list. If no list was provided, then the user will be prompted to select from a list of available feature types in the provided database.

The provided list may be a list of feature types or a list of single file data sources, including Bam, bigWig, or bigBed files. If the list includes feature types, they are verified as existing in the provided database. If the list includes file names, the local files are checked and the names prefixed with "file:" for use downstream.

If no list was provided, then a list of available feature types in the provided database will be presented to the user, and the user prompted to make a selection. One or more types may be selected, and a single item may be enforced if requested. The response is filtered through the parse_list() method from Bio::ToolBox::utility, so a mix of single numbers or a range of numbers may be accepted. The responses are then validated.

Two types of databases are supported: Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases. For SeqFeature Store databases, the type is comprised of "method:source", corresponding to the third and second GFF columns. For BigWigSet databases, types correspond to either the type, method, primary_tag, or display_name attributes, in that order of availability.

For feature types or files that pass validation, they are returned as a list. Types of files that do not pass validation are printed to STDOUT.

To use this subroutine, pass an array with the following keys and values. Not every key is required.

  db       => The name of the database or a reference to an 
              established BioPerl database object. A 
              Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet 
              database are typically used. Only required when 
              no features are provided.
  feature  => Pass either a single dataset name as a scalar or an 
              anonymous array reference of a list of dataset names. 
              These may have been provided as a command line option and 
              need to be verified. If nothing is passed, then a list of 
              possible datasets will be presented to the user to be 
              chosen.
  prompt   => Provide a phrase to be prompted to the user to help in 
              selecting datasets from the list. If none is provided, a 
              generic prompt will be used.
  single   => A Boolean value (1 or 0) indicating whether only a single 
              dataset is allowed when selecting datasets from a 
              presented list. If true, only one dataset choice is 
              accepted. If false, one or more dataset choices are 
              allowed. Default is false.
  limit    => Optionally provide a word or regular expression that matches
              the feature type (primary_tag only; source_tag, if present, 
              is ignored). For example, provide "gene|mrna" to only 
              present gene and mRNA features to the user. This is only 
              applicable when a user must select from a database list. 
              The default is to list all available feature types.

The subroutine will return a list of the accepted datasets. It will print bad dataset names to standard out.

Example:

        # verify a dataset, either file or database type
        my $db_name = 'cerevisiae';
        my $dataset = 'microaray_data';
        my @features = verify_or_request_feature_types(
                db      => $db_name,
                feature => $dataset,
        );
        unless (@features) {
                die " no valid feature provided!\n";
        }
        
        # select a list of features
        my $db_name = 'cerevisiae';
        my @types = (); # user not provided
        @types = verify_or_request_feature_types(
                db      => $db_name,
                feature => $types,
        );
        # user will be promoted to select from database list
        if (@types) {
                print "using types " . join(", ", @types);
        }
        else {
                die " no valid features selected!\n";
        }
        
check_dataset_for_rpm_support()

This subroutine will check a dataset for RPM, or Reads Per Million mapped, support. Only two types of database files support this, Bam files and BigBed files. If the dataset is either one of these, or the name of a database feature which points to one of these files, then it will calculate the total number of mapped alignments (Bam file) or features (BigBed file). It will return this total number. If the dataset does not support RPM (because it is not a Bam or BigBed file, for example), then it will return undefined.

Pass this subroutine one or two values. The first is the name of the dataset. Ideally it should be validated using verify_or_request_feature_types() and have an appropriate prefix (file, http, or ftp). If it does not have a prefix, then it is assumed to be a database feature. The second passed feature is the name of a BioPerl database or an opened database object. This database will be checked for the indicated dataset, and the first returned feature checked for an attribute pointing to a supported file.

For multi-threaded execution pass a third value, the number of CPU cores available to count Bam files. The default is 2 for environments where Parallel::ForkManager is installed, or 1 where it is not.

get_new_feature_list

This subroutine will generate a new feature list collected from the database. Once the list of genomic features is generated, then data may be collected for each item in the list.

The subroutine will generate and return a data hash as described in Bio::ToolBox::file_helper. The data table will have two or three columns. The feature name and type:source are listed in columns one and two, respectively. If the features have an Alias tag, then a third column is included with a comma delimited list of the feature aliases.

The subroutine is passed an array containing the arguments. The keys include

  Required:
  db       => The name of the database or a reference to an 
              established database object. 
  features => A scalar value containing a name representing the 
              type(s) of feature(s) to collect. This name will be 
              parsed into an actual list with the internal subroutine 
              _features_to_classes(). Refer to that documentation for 
              a list of appropriate features.

The subroutine will return a reference to the data hash. It will print status messages to STDOUT.

Example

        my $db_name = 'cerevisiae';
        my %data = get_new_feature_list(
                'db'        => $db_name,
                'features'  => 'genes',
        );
get_new_genome_list

This subroutine will generate a new list of genomic windows. The genome is split into intervals of a specified size that is moved along the genome in specified step sizes.

The subroutine will generate and return a data hash as described in Bio::ToolBox::file_helper. The data table will have 3 columns, including Chromosome, Start, and Stop.

The subroutine is passed an array containing the arguments. The keys include

  Required:
  db       => The name of the database or a reference to an 
              established database object. 
  Optional: 
  win      => A scalar value containing an integer representing the
              size of the window in basepairs. The default value 
              is defined in biotoolbox.cfg file.
  step     => A scalar value containing an integer representing the
              step size for advancing the window across the genome. 
              The default is the window size.

The subroutine will return a reference to the data hash. It will print status messages to STDOUT.

Example

        my $db_name = 'cerevisiae';
        my $window_size = 500;
        my $step_size = 250;
        my %data = get_new_genome_list(
                'db'        => $db_name,
                'win'       => $window_size,
                'step'      => $step_size,
        );
validate_included_feature

This subroutine will validate a database feature to make sure it is useable. It will check feature attributes and compare them against a list of attributes and values to be avoided. The list of unwanted attributes and values is stored in the BioToolBox configuration file biotoolbox.cfg.

Pass the subroutine a Bio::DB::SeqFeature::Store database feature. It will return true (1) if the feature passes validation and false (undefined) if it contains an excluded attribute and value.

get_feature

This subroutine will retrieve a specific feature from a Bio::DB::SeqFeature::Store database for subsequent analysis, manipulation, and/or score retrieval using the get_chromo_region_score() or get_region_dataset_hash() methods. It relies upon unique information to pull out a single, unique feature.

Several attributes may be used to pull out the feature, including the feature's unique database primary ID, name and/or aliases, and GFF type (primary_tag and/or source). The get_new_feature_list() subroutine will generate a list of features with their unique identifiers.

The primary_id attribute is preferentially used as it provides the best performance. However, it is not portable between databases or even re-loading. In that case, the display_name and type are used to identify potential features. Note that the display_name may not be unique in the database. In this case, the addition of aliases may help. If all else fails, a new feature list should be generated.

To get a feature, pass an array of arguments. The keys include

  Required:
  db       => The name of the Bio::DB::SeqFeature::Store database or 
              a reference to an established database object. 
  id       => Provide the primary_id tag. In the 
              Bio::DB::SeqFeature::Store database schema this is a 
              (usually) non-portable, unique identifier specific to a 
              database. It provides the fastest lookup.
  name     => A scalar value representing the feature display_name. 
              Aliases may be appended with semicolon delimiters. 
  type     => Provide the feature type, which is typically expressed 
              as primary_tag:source. Alternatively, provide just the 
              primary_tag only.

While it is possible to identify features with any two attributes (or possibly just name or ID), the best performance is obtained with all three together.

The first SeqFeature object is returned if found.

get_chromo_region_score

This subroutine will retrieve a dataset value for a single specified region in the genome. The region is specified with chromosomal coordinates: chromosome name, start, and stop. It will collect all dataset values within the window, combine them with the specified method, and return the single value.

The subroutine is passed an array containing the arguments. The keys include

  Required:
  db       => The name of the database or a reference to an 
              established database object. Optional if an 
              indexed dataset file (Bam, BigWig, etc.) is provided.
  dataset  => The name of the dataset in the database to be 
              collected. The name should correspond to a feature 
              type in the database, either as type or type:source. 
              The name should be verified using the 
              subroutine verify_or_request_feature_types() prior to passing.
              Multiple datasets may be given, joined by '&', with no
              spaces. Alternatively, specify a data file name. 
              A local file should be prefixed with 'file:', while 
              a remote file should be prefixed with the transfer 
              protocol (ftp: or http:).
  method   => The method used to combine the dataset values found
              in the defined region. Acceptable values include 
              sum, mean, median, range, stddev, min, max, rpm, 
              rpkm, and scores. See _get_segment_score() 
              documentation for more info.
  chromo   => The name of the chromosome (reference sequence)
  start    => The start position of the region on the chromosome
  stop     => The stop position of the region on the chromosome
  end      => Alias for stop
  
  Optional:
  strand   => The strand of the region (-1, 0, or 1) on the 
              chromosome. The default is 0, or unstranded.
  value    => Specify the type of value to collect. Acceptable 
              values include score, count, or length. The default 
              value type is score. 
  log      => Boolean value (1 or 0) indicating whether the dataset 
              values are in log2 space or not. If undefined, the 
              dataset name will be checked for the presence of the 
              phrase 'log2' and set accordingly. This argument is
              critical for accurately mathematically combining 
              dataset values in the region.
  stranded => Indicate whether the dataset values from a specific 
              strand relative to the feature should be collected. 
              Acceptable values include sense, antisense, or all.
              Default is 'all'.
  rpm_sum  => When collecting rpm or rpkm values, the total number  
              of alignments may be provided here. Especially 
              useful when collecting via parallel forked processes, 
              otherwise each fork will sum the number of alignments 
              independently, an expensive proposition. 
                  

The subroutine will return the region score if successful.

Examples

        my $db = open_db_connection('cerevisiae');
        my $score = get_chromo_region_score(
                'db'      => $db,
                'method'  => 'mean',
                'dataset' => $dataset,
                'chr'     => $chromo,
                'start'   => $startposition,
                'stop'    => $stopposition,
                'log'     => 1,
        );
        
get_region_dataset_hash

This subroutine will retrieve dataset values or feature attributes from features located within a defined region and return them as a hash. The (start) positions will be the hash keys, and the corresponding dataset values or attributes will be the hash values. The region is defined based on a genomic feature in the database. The region corresponding to the entire feature is selected by default. Alternatively, it may be adjusted by passing appropriate arguments.

Different dataset values may be collected. The default is to collect score values of the dataset features found within the region (e.g. microarray values). Alternatively, a count of found dataset features may be returned, or the lengths of the found dataset features. When lengths are used, the midpoint position of the feature is used in the returned hash rather than the start position.

The returned hash is keyed by relative coordinates and their scores. For example, requesting a region from -200 to +200 of a feature (using the start and stop options, below) will return a hash whose keys are relative to the feature start position, i.e. the keys will >= -200 and <= 200. Absolute coordinates relative to the reference sequence or chromosome may be optionally returned instead.

The subroutine is passed an array containing the arguments. The keys include

  Required:
  db       => The name of the annotation database or a reference to 
              an established database object. 
  dataset  => The name of the dataset in the database to be 
              collected. The name should correspond to a feature 
              type in the database, either as type or type:source. 
              The name should be verified using the 
              subroutine verify_or_request_feature_types() prior to passing.
              Multiple datasets may be given, joined by '&', with no
              spaces. Alternatively, specify a data file name. 
              A local file should be prefixed with 'file:', while 
              a remote file should be prefixed with the transfer 
              protocol (ftp: or http:).
  Required for database features:
  id       => The Primary ID of the genomic feature. This is 
              database specific and may be used alone to identify 
              a genomic feature, or in conjunction with name and type.
  name     => The name of the genomic feature. Required if the Primary 
              ID is not provided. 
  type     => The type of the genomic feature. Required if the Primary 
              ID is not provided. 
  Required for coordinate positions:
  chromo   => The chromosome or sequence name (seq_id). This may be 
              used instead of name and type to specify a genomic 
              segment. This must be used with start and stop options, 
              and optionally strand options.
  Optional:
  start    => Indicate an integer value representing the start  
              position of the region relative to the feature start.
              Use a negative value to begin upstream of the feature.
              Must be combined with "stop".
  stop|end => Indicate an integer value representing the stop  
              position of the region relative to the feature start.
              Use a negative value to begin upstream of the feature.
              Must be combined with "start".
  ddb      => The name of the data-specific database or a reference 
              to an established database. Use when the data and 
              annotation are in separate databases.
  extend   => Indicate an integer value representing the number of 
              bp the feature's region should be extended on both
              sides.
  position => Indicate the relative position of the feature from 
              which the "start" and "stop" positions are calculated.
              Three values are accepted: "5", which denotes the 
              5' end of the feature, "3" which denotes the 
              3' end of the feature, or "4" which denotes the 
              middle of the feature. This option is only used in 
              conjunction with "start" and "stop" options. The 
              default value is "5".
  strand   => For those features or regions that are NOT 
              inherently stranded (strand 0), artificially set the 
              strand. Three values are accepted: -1, 0, 1. This 
              will overwrite any pre-existing value (it will not, 
              however, migrate back to the database).
  stranded => Indicate whether the dataset values from a specific 
              strand relative to the feature should be collected. 
              Acceptable values include sense, antisense, or all.
              Default is all.
  value    => Indicate which attribute will be returned. Acceptable 
              values include "score", "count", or "length". The  
              default behavior will be to return the score values.
  avoid    => Provide an array reference of database feature types 
              that should be avoided. Any positioned scores which 
              overlap the other feature(s) are not returned. The 
              default is to return all values. A boolean value of 
              1 can also be passed, in which case the same type 
              of feature as the search feature will be used. This
              was the original implementation (v1.35 and below). 
  absolute => Boolean value to indicate that absolute coordinates 
              should be returned, instead of transforming to 
              relative coordinates, which is the default.
                  

The subroutine will return the hash if successful.

Example

        my $db = open_db_connection('cerevisiae');
        my %region_scores = get_region_dataset_hash(
                'db'      => $db,
                'dataset' => $dataset,
                'name'    => $name,
                'type'    => $type,
        );
        
get_chromosome_list

This subroutine will collect a list of chromosomes or reference sequences in a Bio::DB database and return the list along with their sizes in bp. Many BioPerl-based databases are supported, including Bio::DB::SeqFeature::Store, Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed, or any others that support the "seq_ids" method. See the open_db_connection subroutine for more information.

Pass the subroutine either the name of the database, the path to the database file, or an opened database object.

Optionally pass a second value, a boolean argument to limit and exclude unwanted chromosomes as defined by the "chromosome_exclude" option in the BioToolBox configuration file, biotoolbox.cfg. A true value limits chromosomes, and false includes all chromosomes. The default is to return all chromosomes. Sometimes some sequences are simply not wanted in analysis, like the mitochondrial chromosome or unmapped contigs.

The subroutine will return an array, with each element representing each chromosome or reference sequence in the database. Each element is an anonymous array of two elements, the chromosome name and length in bp.

Example

        my $db = open_db_connection('cerevisiae');
        # get all chromosomes in the database
        my @chromosomes = get_chromosome_list($db);
        foreach (@chromosomes) {
                my $name = $_->[0];
                my $length = $_->[1];
                print "chromosome $name is $length bp\n";
        }

INTERNAL SUBROUTINES

These are not intended for normal consumption but are documented here so that we know what is going on.

_features_to_classes

This internal subroutine provides a conveniant look up and conversion of a single-word description of a category of features into a list of actual feature classes in the database. For example, the word 'gene' may include all ORFs, snRNAs, snoRNAs, and ncRNAs.

Pass the subroutine the feature category name as a scalar value. The actual list of feature types will be collected and returned as an array. Multiple values may be passed as a comma-delimited string (no spaces).

The aliases and feature lists are specified in the Bio::ToolBox::db_helper configuration file, biotoolbox.cfg. Additional lists and aliases may be placed there. The lists are database specific, or they can be added to the default database.

If the passed category name does not match an alias in the config file, then it is assumed to be a feature in the database. No further validation will be done (if it's not valid, simply no features would be returned from the database).

Also, feature types may be passed as the GFF's method:source, in which case they are assumed to be valid and not checked.

_get_segment_score

This internal subroutine is used to collect the dataset scores for an established genomic region. It works with a variety of data sources.

First, the data may be stored directly in the Bio::DB::SeqFeature::Store (using the original GFF score value). Second, the feature may reference a data file as an attribute (e.g., wigfile=/path/to/file.wib). Finally, the name(s) of a data file may be passed from which to collect the data. Supported data files include BigWig (.bw), BigBed (.bb), USeq (.useq), and Bam (.bam). A Bio::DB::BigWigSet database is also supported.

The subroutine is passed an array of ten specific values, all of which must be defined and presented in this order. These values include

  [0] The opened database object that may be used to generate the 
      the segment and contain the data to collect. If the dataset 
      request is from a big file (bigWig, bigBed, Bam), then this 
      database will likely not be used. Otherwise a 
      Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database 
      object should be passed.
  [1] The chromosome or seq_id of the segment to examine
  [2] The start position of the segment
  [3] The stop or end position of the segment
  [4] The strand of the region (or original feature) -1, 0, or 1.
  [5] The dataset name for filename. Multiple datasets may be included, 
      delimited with an ampersand (&). Multiple datasets are merged into 
      one, unless excluded by strand. Local data source files should be 
      prepended with 'file:', while remote data source files should be 
      prepended with the transfer protocol (http: or ftp:).
  [6] The data type to be collecting. In most cases, the score value 
      is used, but others may be collected. Accepted values include
         
         score
         count
         length
         
  [7] The method of combining all of the dataset values found in the 
      segment into a single value. Accepted values include
         
         sum
         mean
         median
         min
         max
         range (returns difference between max and min)
         stddev (returns the standard deviation of a population)
         indexed (returns hash of postion => score)
         rpm (returns reads per million mapped, only valid with 
              bam and bigbed databases)
         rpkm (same as rpm but normalized for length in kb)
         scores (returns an array reference of all the raw scores)
         
  [8] The strandedness of acceptable data. Genomic segments 
      established from an inherently stranded database feature 
      (e.g. ORF) have a specific strandedness. If the dataset strand 
      does not match the specified criteria for strandedness, then it 
      is ignored. If the dataset does not have a specified strand, 
      then it is used regardless of the specified criteria for 
      strandedness. Currently, only transcription data is stranded. 
      Accepted values include
         
         sense       take only values on the same strand
         antisense   take only values on the opposite strand
         all         take all values
         
  [9] The log status of the dataset. Many microarray datasets are in 
      log2 format, and must be de-logged to perform accurate 
      mathematical calculations, such as mean or median. Supply a 
      boolean (0 or 1) value.
      
  

The subroutine returns a single numeric value if appropriate dataset values are identified in the genomic segment. If not, then a non-value (.) is returned. Use of a period as a non-value avoids un-defined errors in some subsequent programmatic manipulations and provides a convenient human visual marker.

_return_null

Internal method for returning a 0 or internal null '.' character based on the method being used.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.