The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CracTools::Utils - A set of useful functions

VERSION

version 1.251

SYNOPSIS

  # Reverse complementing a sequence
  my $seq = reverseComplemente("ATGC");

  # Reading a FASTQ file
  my $it = seqFileIterator('file.fastq','fastq');
  while(my $entry = $it->()) {
    print "Sequence name   : $entry->{name}
           Sequence        : $entry->{seq}
           Sequence quality: $entry->{qual}","\n";
  }

  # Reading paired-end files easier
  my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
  while (my $entry = $it->()) {
    print "Read_1 : $entry->{read1}->{seq}
           Read_2 : $entry->{read2}->{seq}";
  }

  # Parsing a GFF file
  my $it = gffFileIterator($file);
  while (my $annot = $it->()) {
    print "chr    : $annot->{chr}
           start  : $annot->{start}
           end    : $annot->{end}";
  }

DESCRIPTION

Bio::Lite is a set of subroutines that aims to answer similar questions as Bio-perl distribution in a FAST and SIMPLE way.

Bio::Lite does not make use of complexe data struture, or objects, that would lead to a slow execution.

All methods can be imported with a single "use Bio::Lite".

Bio::Lite is a lightweight-single-module with NO DEPENDENCIES.

UTILS

reverseComplement

Reverse complemente the (nucleotid) sequence in arguement.

Example:

  my $seq_revcomp = reverseComplement($seq);

reverseComplement is more than 100x faster than Bio-Perl revcom_as_string()

reverse_tab

  Arg [1] : String - a string with values separated with coma.
  Example : $reverse = reverse_tab('2,1,1,1,0,0,1');
  Description : Reverse the values of the string in argument.
                For example : reverse_tab('1,2,0,1') returns : '1,0,2,1'.
  ReturnType  : String
  Exceptions  : none

isVersionGreaterOrEqual($v1,$v2)

Return true is version number v1 is greater than v2

convertStrand

Convert strand from '+/-' standard to '1/-1' standard and the opposite.

Example:

  say "Forward a: ",convertStrand('+');
  say "Forward b: ",convertStrand(1);
  say "Reverse a: ",convertStrand('-');
  say "Reverss b: ",convertStrand(-1);

will print

  Forward a: 1
  Forward b: +
  Reverse a: -1
  Reverse b: -

removeChrPrefix

Remove the "chr" prefix from a given string

Example:

  say "reference name: ",removeChrPrefix("chr1");

will print

  reference name: 1

addChrPrefix

Add the "chr" prefix to the given string

ENCODING

encodePosListToBase64

Encode a (0-based) list of increasing position to a string using Base64 encoding scheme : ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

  my $encoded_list = CracTools::Utils::encodePosListToBase64(1,3,5,8,12,32);
  my @decoded_list = CracTools::Utils::decodePosListInBase64($encoded_list);

decodePosListInBase64

Decode position list encoded by encodePosListToBase64.

PARSING

This are some tools that aim to read (bio) files like

Sequence files : FASTA, FASTQ
Annotation files : GFF3, GTF2, BED6, BED12, ...
Alignement files : SAM, BAM

seqFileIterator

Open Fasta, or Fastq files (can be gziped). seqFileIterator has an automatic file extension detection but you can force it using a second parameter with the format : 'fasta' or 'fastq'.

Example:

  my $it = seqFileIterator('file.fastq','fastq');
  while(my $entry = $it->()) {
    print "Sequence name   : $entry->{name}
           Sequence        : $entry->{seq}
           Sequence quality: $entry->{qual}","\n";
  }

Return: HashRef

  { name => 'sequence_identifier',
    seq  => 'sequence_value',
    qual => 'sequence_quality', # only defined for FASTQ files
  }

seqFileIterator is more than 50x faster than Bio-Perl Bio::SeqIO for FASTQ files seqFileIterator is 4x faster than Bio-Perl Bio::SeqIO for FASTA files

pairedEndSeqFileIterator

Open Paired-End Sequence files using seqFileIterator()

Paird-End files are generated by Next Generation Sequencing technologies (like Illumina) where two reads are sequenced from the same DNA fragment and saved in separated files.

Example:

  my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
  while (my $entry = $it->()) {
    print "Read_1 : $entry->{read1}->{seq}
           Read_2 : $entry->{read2}->{seq}";
  }

Return: HashRef

  { read1 => 'see seqFileIterator() return',
    read2 => 'see seqFileIterator() return'
  }

pairedEndSeqFileIterator has no equivalent in Bio-Perl

writeSeq

  CracTools::Utils::writeSeq($filehandle,$format,$seq_name,$seq,$seq_qual)

Write the sequence in the output stream with the specified format.

bedFileIterator

manage BED files format

Example:

  my $it = bedFileIterator($file);
  while (my $annot = $it->()) {
    print "chr    : $annot->{chr}
           start  : $annot->{start}
           end    : $annot->{end}";
  }

Return a hashref with the annotation parsed:

  { chr         => 'field_1',
    start       => 'field_2',
    end         => 'field_3',
    name        => 'field_4',
    score       => 'field_5',
    strand      => 'field_6',
    thick_start => 'field_7',
    thick_end   => 'field_8',
    rgb         => 'field_9'
    blocks      => [ {'size' => 'block size',
                      'start' => 'block start',
                      'end'   => 'block start + block_size',
                      'ref_start' => 'block start on the reference',
                      'ref_end'   => 'block end on the reference'}, ... ],
    seek_pos    => 'Seek position of this line in the file',
  }

gffFileIterator

manage GFF3 and GTF2 file format

Example:

  my $it = gffFileIterator($file,'type');
  while (my $annot = $it->()) {
    print "chr    : $annot->{chr}
           start  : $annot->{start}
           end    : $annot->{end}";
  }

Return a hashref with the annotation parsed:

  { chr         => 'field_1',
    source      => 'field_2',
    feature     => 'field_3',
    start       => 'field_4',
    end         => 'field_5',
    score       => 'field_6',
    strand      => 'field_7',
    frame       => 'field_8'
    attributes  => { 'attribute_id' => 'attribute_value', ...},
    seek_pos    => 'Seek position of this line in the file',
  }

gffFileIterator is 5x faster than Bio-Perl Bio::Tools::GFF

vcfFileIterator

manage VCF file format

Return a hashref with the annotation parsed:

  { chr => $chr,
    pos     => $pos,
    id      => $id,
    ref     => $ref,
    alt     => [ alt1, alt2, ...],
    qual    => $qual,
    filter  => $filter,
    info    => { AS => value,
                 DP => value,
                 ...
                 ,
  };

chimCTFileIterator

Return a hashref with the chimera parsed:

  {
    sample            => $sample,
    chim_key          => $chim_key,
    name              => $name,
    chr1              => $chr1,
    pos1              => $pos1,
    strand1           => $strand1,
    chr2              => $chr2,
    pos2              => $pos2,
    strand2           => $strand2,
    chim_value        => $chim_value,
    spanning_junction => $spanning_junction,
    spanning_PE       => $spanning_PE,
    class             => $class,
    comments          => { coment_id => 'comment_value', ... },
    extended_fields     => { extended_field_id => 'extended_field_value', ... },
  }

bamFileIterator

BE AWARE this method is only availble if samtools binary is availble.

Return an iterator over a BAM file using a samtools view pipe.

A region can be passed in parameter to restrict the results. In this case the BAM file must be indexed

Example:

  my $fh = bamFileIterator("file.bam","17:43,971,748-44,105,700");
  while(my $line = <$fh>) {
    my $parsed_line = CracTools::SAMReader::SAMline->new($line);
    // do some stuff
  }

SEE ALSO CracTools::SAMReader::SAMline if you need to parse SAMlines easily

getSeqFromIndexedRef

BE AWARE this method is only availble if samtools binary is availble.

Return a sequence from a given region in a fasta indexed file

Example:

  my $fasta_seq = getSeqFromIndexedRef("file.fa","chr2",29012,10);
  my $seq       = getSeqFromIndexedRef("file.fa","chr2",29012,10,'raw');

PARSING LINES

parseBedLine

parseGFFLine

parseVCFLine

parseChimCTLine

parseSAMLineLite

parseCigarChain

Given a CIGAR chain (see SAM specification), return a parsed version as an Array ref of cigar elements represented as { nb => 10, op => 'M' }.

FILES IO

getFileIterator

Generic method to parse files.

getReadingFileHandle

Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)

Example:

  my $fh = getReadingFileHandle('file.txt.gz');
  while(<$fh>) {
    print $_;
  }
  close $fh;

getWritingFileHandle

Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)

Example:

  my $fh = getWritingFileHandle('file.txt.gz');
  print $fh "Hello world\n";
  close $fh;

getLineFromSeekPos

  getLineFromSeekPos($filehandle,$seek_pos);

return a chomped line at a seeking position.

AUTHORS

  • Nicolas PHILIPPE <nphilippe.research@gmail.com>

  • Jérôme AUDOUX <jaudoux@cpan.org>

  • Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2017 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).

This is free software, licensed under:

  The GNU Affero General Public License, Version 3, November 2007