The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CracTools::Annotator - Generic annotation base on CracTools::GFF::Query::File

VERSION

version 1.251

SYNOPSIS

  # Construct tha annotator object that will index the GFF file in
  # a genomic interal-tree based structure
  my $annotator = CracTools::Annotator->new("annotation.gff");

  # Query the annotator object for overlapping annotations
  my $annot = $annotator->getBestAnnotationCandidate("chr1",12345,12380);

  if(defined $annot->{exon}) {
    print STDERR "Found overlapping exon\n";
  } else {
    # If no overlapping exons have been found, we check for the closest gene
    # in the downstream direction
    my $closest_annot = $annotator->getAnnotationNearestDownCandidates()->[0];
    if(defined $closest_annot && defined $closest_annot->{gene}) {
      print STDERR "Closest gene annotation is ".12345 - $closest_annot->{gene}->end."bp away\n";
    }
  }

DESCRIPTION

This module is based on CracTools::Interval::Query::File and provides powerfull methods to query annotation files and prioritize hits to fit specific application needs.

Annotator work with 0-based coordinate system and closed [a,b] intervals.

The principle behind CracTools::Annotator is to build a genomic interval tree that holds the annotations. Then, the user can query this datastructure to retrieve annotations. In order to organized the retrieved annotations, we build candidates hashes that are a branch of the annotation tree. For a classic GFF annotation file, if the queried interval overlap and exon, the branch of the annotation tree, will go from an exon leaf up to the gene root passing by an mRNA internal node.

Candidate structure

An annotation candidate is a hash datastructure, where keys are GFF features (exon, gene, mRNA) and values are CracTools::GFF::Annotation object (a parsed GFF line).

It also contains an entry parent_feature that holds the parenting links between features, and an entry leaf_feature that holds the feature name of the leaf ("exon" for example).

  my $candidate = {
    "exon" => CracTools::GFF::Annotation, 
    "gene" => CracTools::GFF::Annotation,
    "feature" => CracTools::GFF::Annotation, ..., 
    parent_feature => {exon => mRNA, featureA => featureB, ...},
    leaf_feature => "exon",
  };

Priority methods

Each annotation query can be parametrized with priorization methods that will choose a set of "best" annotation(s) to be returned to the user. In this module we propose default priorization method, but you can create your own in order to fit your application needs.

There is two kind of priorization method, prioritySub and comparSub.

Priority subroutine

The priority subroutine (by default "getCandidatePriorityDefault") recieve as input the queried interval (start and end pos) and an annotation candidate. As output the subroutine must return a priority level (the lower being more important), and a string variable that is a literal version of the priority level.

Compare subroutine

The compare subroutine (by default "compareTwoCandidatesDefault") recieve as input two annotation candidates and the queried interval. As output the subroutine must return the best candidate between the two, or neither (undef) if the subroutine cannot determine.

METHODS

new

  Arg [1] : String - $gff_file
            GFF file used to perform annotation
  Arg [2] : String - $mode
            Execution mode : "fast" or "light" ("light" by default)

  Example     : my $annotator = CracTools::GFF::Annotator->new($gff_file);
  Description : Create a new CracTools::GFF::Annotator object based on the
                provided GFF file. If "light" mode is specified, CracTools::Annotator
                will be less memory consuming but will have a time execution overhead.
  ReturnType  : CracTools::GFF::Annotator

mode

  Description : Return the mode used to create the annotator
  ReturnType  : string ("light" or "fast")

foundAnnotation

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - pos_end
  Arg [4] : String - strand

  Description : Return true if any overlapping annotation has been found
  ReturnType  : Boolean

foundGene

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - pos_end
  Arg [4] : String - strand

  Description : Return true if an overlapping gene annotation has been found
  ReturnType  : Boolean

foundSameGene

  Arg [1] : String - chr
  Arg [2] : String - pos_start1
  Arg [3] : String - pos_end1
  Arg [4] : String - pos_start2
  Arg [5] : String - pos_end1
  Arg [6] : String - strand

  Description : Return true if a same gene overlaps the two intervals.
  ReturnType  : Boolean

getBestAnnotationCandidate

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - pos_end
  Arg [4] : String - strand
  Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
  Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details

  Description : Return best annotation candidate according to the priorities given
                by the subroutine(s) in argument.
  ReturnType  : AnnotationCandidate, Int(priority), String(type)

getBestAnnotationCandidates

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - pos_end
  Arg [4] : String - strand
  Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
  Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details

  Description : Return best annotation candidates according to the priorities given
                by the subroutine(s) in argument.
  ReturnType  : ArrayRef of AnnotationCandidates, Int(priority), String(type)

getAnnotationCandidates

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - pos_end
  Arg [4] : String - strand

  Description : Return an array with all annotation candidates overlapping the
                chromosomic region.
  ReturnType  : ArrayRef of AnnotationCandidate

getAnnotationNearestDownCandidates

  Arg [1] : String - chr
  Arg [2] : String - pos_start
  Arg [3] : String - strand

  Description : Return an array with all annotation candidates nearest down the
                query region (without overlap).
  ReturnType  : ArrayRef of AnnotationCandidate

getAnnotationNearestUpCandidates

  Arg [1] : String - chr
  Arg [2] : String - pos_end
  Arg [3] : String - strand

  Description : Return an array with all annotation candidates nearest up the
                query region (without overlap).
  ReturnType  : ArrayRef of AnnotationCandidate

getCandidatePriorityDefault

  Arg [1] : String - pos_start
  Arg [2] : String - pos_end
  Arg [3] : hash - candidate

  Description : Default method used to give a priority to a candidate.
                You can create your own priority method to fit your specific need
                for selecting the best annotation.
                The best priority is 0. A priority of -1 means that this candidate
                should be avoided.
  ReturnType  : Array($priority,$type) where $priority is an integer and $type a string

compareTwoCandidatesDefault

  Arg [1] : hash - candidate1
  Arg [2] : hash - candidate2
  Arg [3] : pos_start (position start that has been queried)
  Arg [4] : pos_end (position end that has been queried)

  Description : Default method used to chose the best candidat when priority are equals
                You can create your own priority method to fit your specific need
                for selecting the best candidat.
  ReturnType  : AnnotationCandidate - best candidate or undef if we cannot decide which candidate is the best

PRIVATE METHODS

_init

  Description : init method, load GFF annotation into a
                CracTools::GFF::Query object.

_constructCandidates

  Arg [1] : String - annot_id
  Arg [2] : Hash ref - candidate
            Since this method is recursive, this is the object that
            we are constructing
  Arg [3] : Hash ref - annot_hash
            annot_hash is a hash reference where keys are annotion IDs
            and values are CracTools::GFF::Annotation objects.

  Description : _constructCandidate is a recursive method that build a
                candidate hash. A candidate is defined as a path into the annotation
                (multi-rooted) tree from a leaf (ex: an exon) to a root (ex: a gene).
  ReturnType  : Candidate Hash ref where keys are GFF features and
                values are CracTools::GFF::Annotation objects :
                { "exon" => CracTools::GFF::Annotation, 
                  "gene" => CracTools::GFF::Annotation,
                  feature => CracTools::GFF::Annotation, ..., 
                  parent_feature => {featureA => featureB},
                  leaf_feature => "exon",
                }

_constructCandidatesFromAnnotation

  Arg [1] : Hash ref - annotations
            Annotions is a hash reference where keys are coordinates
            given by CracTools::Interval::Query::File objects.
  Description : _constructCandidate is a recursive method that build a
                candidate hash.
  ReturnType  : Candidate array ref of all candidates built by _constructCandidate

AUTHORS

  • Nicolas PHILIPPE <nphilippe.research@gmail.com>

  • Jérôme AUDOUX <jaudoux@cpan.org>

  • Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2017 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).

This is free software, licensed under:

  The GNU Affero General Public License, Version 3, November 2007