Description
Arguably, the backbone of GenOO is the Region
role that corresponds to an area on a reference sequence. It requires other classes that consume it, to implement specific attributes such as the strand
, rname
(reference name), start
, stop
and copy_number
. This role is consumed by several other classes within the framework and provides common grounds for code integration. Extending this approach, the GenomicRegion
class consumes Region
and additionally sets the constraint that the reference sequence has to be a particular chromosome. The GenomicRegion
serves as the base for the representation of advanced genomic elements such as the genes, gene transcripts, 5'UTRs, non-coding RNAs and others.
Details
Gene
A Gene
, in essence, is defined as a set of Transcript
objects which also must share some positional overlap.
According to recent annotations and contrary to common conception, genes cannot be divided into protein coding and non-coding ones. Instead, and possibly more correctly even in biological terms, a user can ask if a gene has coding potential or not. In this case a gene will scan through its assigned transcripts and check if there are any coding ones or not.
Given the above gene definition, it is perhaps surprising that a gene extends the GenomicRegion
class. However, this should not be the case as a gene object can extract positional information from its assigned transcripts. For example, the start position of a gene is defined as the smaller start position of its transcripts. Similarly, its strand is defined as the strand of its transcripts which by the way must be the same for all its transcripts.
Transcript / Isoform
The Transcript
class corresponds to a gene transcript/isoform and can be an independent object or more commonly belong to a Gene
object.
Contrary to a gene object, a transcript object does not internally look upstream to its assigned gene to extract infromation. This is done on purpose to avoid strange cyclic assignments and also because we believe that the transcript annotation should serve as the base for the gene annotation and not vice versa. Therefore, information extraction from the gene level, although possible, is left entirelly on the user.
Transcripts contrary to genes are divided into protein coding and non-coding ones. Note that protein coding transcripts in contrast to non-coding ones have methods that extract the coding (CDS
), 5’ UTR (UTR5
) and 3’UTR (UTR3
) sequences and coordinates.
A particularly important (as people that have worked with alternative splicing can verify) structure within the genomic group of classes is the Spliceable
role. This role groups the functionality for entities/classes that undergo alternative splicing and supports several advanced methods such as the extraction of exonic and intronic elements and facilitates management of the complex structures. Importantly, Spliceable
is primarily consumed by Transcript
but it is also consumed by UTR5
, UTR3
and CDS
. This has a very interesting and in several cases very useful side-effect that for example, one can ask for the introns that are extracted from the 3'UTR sequence of a transcript ($transcript->utr3->introns
)
Examples
Creating a transcript
my
$transcript
= GenOO::Transcript->new(
id
=>
'transcr_1'
,
strand
=> 1,
chromosome
=>
'chrY'
,
start
=> 100,
stop
=> 410,
splice_starts
=> [100, 200, 300],
splice_stops
=> [150, 260, 410],
coding_start
=> 220,
coding_stop
=> 370,
biotype
=>
'coding'
,
);
Creating a gene
my
$gene
= GenOO::Gene->new(
name
=>
'Gene_A'
,
transcripts
=> [
$transcript_1
,
$transcript_2
]
# These are objects, not transcript ids
);
Collection of transcripts
# Create a collection of transcripts from a GTF file
my
$transcript_collection
= GenOO::TranscriptCollection::Factory->create(
'GTF'
, {
file
=>
'transcripts_file.gtf'
})->read_collection;
Collection of genes
# A collection of genes can be created from a transcript collection and from a hash that
# assigns transcript ids to gene names
my
$transcript_id_to_genename
= {
'transcr_1'
=>
'Gene_A'
,
'transcr_2'
=>
'Gene_A'
,
'transcr_3'
=>
'Gene_B'
,
# ...
}
my
$gene_collection
= GenOO::GeneCollection::Factory->create(
'FromTranscriptCollection'
, {
transcript_collection
=>
$transcript_collection
,
annotation_hash
=>
$transcript_id_to_genename
})->read_collection;