The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ViennaNGS::SpliceJunc - Perl extension for alternative splicing analysis

SYNOPSIS

  use Bio::ViennaNGS::SpliceJunc;
  use Bio::ViennaNGS::Fasta;

  # get a Bio::ViennaNGS::Fasta object
  my $fastaO = Bio::ViennaNGS::Fasta->new($fasta_in);

  # Extract annotated splice sites from BED12
  bed6_ss_from_bed12($bed12_in,$dest_annot,$window,$want_canonical,$fastaO);

  # Extract mapped splice junctions from RNA-seq data
  bed6_ss_from_rnaseq($bed_in,$dest_ss,$window,$mincov,$want_canonical,$fastaO);

  # Check for each splice junction seen in RNA-seq if it overlaps with
  # any annotated splice junction
  @res = intersect_sj($dest_annot,$dest_ss,$dest,$prefix,$window,$mil);

  # Convert splice junctions seen in RNA-seq data to BED12
  @res = bed6_ss_to_bed12($s_in,$outdir,$window,$mincov,$want_circular);

  # Check whether a splice junction is canonical
  $c = ss_isCanonical($chr,$pos5,$pos3,$fastaO);

DESCRIPTION

Bio::ViennaNGS::SpliceJunc is a Perl module for alternative splicing (AS) analysis. It provides routines for identification, characterization and visualization of novel and existing (annotated) splice junctions from RNA-seq data.

Identification of novel splice junctions is based on intersecting potentially novel splice junctions from RNA-seq data with annotated splice junctions.

SUBROUTINES

bed6_ss_from_bed12($bed12,$dest,$window,$can,$fastaO)

Extracts splice junctions from a BED12 file (provided via argument $bed12), writes a BED6 file for each transcript to $dest, containing all its splice junctions. If $can is 1, canonical splice junctions are reported in the 'name' field of the output BED6 file. Output splice junctions can be flanked by a window of +/- $window nt. $fastaO is a Bio::ViennaNGS::Fasta object. Each splice junction is represented as two bed lines in the output BED6.

bed6_ss_from_rnaseq($bed_in,$dest,$window,$mcov,$can,$fastaO)

Extracts splice junctions from mapped RNA-seq data. The input BED6 file should contain coordinates of introns in the following syntax:

chr1 3913 3996 splits:97:97:97:N:P 0 +

The fourth column in this BED file (correponding to the 'name' field according to the BED specification) should be a colon-separated string of six elements, where the first element should be 'splits' and the second element is assumed to hold the number of reads supporting this splice junction. The fifth element indicates the splice junction type: A capital 'N' determines a normal splice junction, whereas 'C' indicates circular and 'T' indicates trans-splice junctions, respectively. Only normal splice junctions ('N') are considered, the rest is skipped. Elements 3, 4 and 6 are not further processed.

We recommend using segemehl|http://www.bioinf.uni-leipzig.de/Software/segemehl/ for generating this type of BED6 files. This routine is, however, not limited to segemehl output. BED6 files containing splice junction information from other short read mappers or third-party sources will be processed if they are formatted as described above.

This routine writes a BED6 file for each splice junction provided in the input to $dest. Output splice junctions can be flanked by a window of +/- $window nt. Canonical splice junctions are reported in the 'name' field of the output BED6 file if $can is 1 and $featO is a Bio::ViennaNGS::Fasta object. Each splice junction is represented as two BED lines in the output BED6. Only splice junctions that are supported by at least $mcov reads are reported.

bed6_ss_to_bed12($bed_in,$dest,$window,$mcov,$circ)

Produce BED12 output for splice junctions found in RNA-seq data. Input BED6 files (provided via $bed_in) are supposed to conform to the segemehl|http://www.bioinf.uni-leipzig.de/Software/segemehl/ standard format for reporting splice junctions, which has the following syntax:

chr1 3913 3996 splits:97:97:97:N:P 0 +

See bed6_ss_rom_rnaseq for details.

$dest is the output path. Output splice junctions can optionally be flanked by a window of +/- $window nt. Only splice junctions that are supported by at least $mcov reads are reported. If $circ is 1, circular splice junctions are reported (if present in the input), else normal splice junctions are processed.

intersect_sj($p_annot,$p_mapped,$dest,$prefix,$window,$mil)

Intersects all splice junctions identified in an RNA-seq experiment with annotated splice junctions. Identifies and characterizes novel and existing splice junctions. Each BED6 file in $p_mapped is intersected with those transcript splice junction BED6 files in $p_annot, whose genomic location spans the query splice junction. This is to prevent the tool from intersecting each splice site found in the mapped RNA-seq data with all annotated transcripts. $mil specifies a maximum intron length.

The intersection operations are performed with bedtools intersect from the BEDtools suite). BED sorting operations are performed with bedtools sort.

Writes two BED6 files to $dest (optionally prefixed by $prefix), which contain novel and existing splice junctions, respectively.

ss_isCanonical($chr,$p5,$p3,$fo)

Checks whether a given splice junction is canonical, ie. whether the first and last two nucleotides of the enclosed intron correspond to a certain nucleotide motif. $chr is the chromosome name, $p5 and $p3 the 5' and 3' ends of the splice junction and $fo is a Bio::ViennaNGS::Fasta object holding the underlying reference genome

This routine does not explicitly consider standedness in the sense that splice junction motifs are evaluated in terms of the forward strand of the underlying reference sequence. This is best explained by an example: Consider the splice junction motif GU->G on the reverse strand. In 5' to 3' direction of the forward strandm this junction reads CT->AC. A splice junction is canonical if its motif corresponds to one of the following cases:

  5'===]GT|CT....AG|AC[====3' ie GT->AG or CT->AC
  5'===]GC|CT....AG|GC[====3' ie GC->AG or CT->GC
  5'===]AT|GT....AC|AT[====3' ie AT->AC or GT->AT

DEPENDENCIES

This modules depends on the following Perl modules:

Bio::ViennaNGS
Bio::ViennaNGS::Fasta
IPC::Cmd
Path::Class
Carp

Bio::ViennaNGS::SpliceJunc uses third-party tools for computing intersections of BED files: bedtools intersect from the BEDtools suite is used to compute overlaps and bedtools sort is used to sort BED output files. Make sure that those third-party utilities are available on your system, and that hey can be found and executed by the perl interpreter. We recommend installing the latest version of BEDtools on your system.

SEE ALSO

Bio::ViennaNGS

AUTHOR

Michael T. Wolfinger <michael@wolfinger.eu>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2017 Michael T. Wolfinger <michael@wolfinger.eu>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.