++ed by:

2 non-PAUSE users.

Andrea Telatin
and 1 contributors

NAME

Proch::N50 - a small module to calculate N50 (total size, and total number of sequences) for a FASTA or FASTQ file. It's small and without dependencies.

VERSION

version 0.06

SYNOPSIS

  use Proch::N50 qw(getStats getN50);
  my $filepath = '/path/to/assembly.fasta';

  # Get N50 only: getN50(file) will return an integer
  print "N50 only:\t", getN50($filepath), "\n";

  # Full stats
  my $seq_stats = getStats($filepath);
  print Data::Dumper->Dump( [ $seq_stats ], [ qw(*FASTA_stats) ] );
  # Will print:
  # %FASTA_stats = (
  #               'N50' => 65,
  #               'min' => 4,
  #               'max' => 65,
  #               'dirname' => 'data',
  #               'size' => 130,
  #               'seqs' => 6,
  #               'filename' => 'small_test.fa',
  #               'status' => 1
  #             );

  # Get also a JSON object
  my $seq_stats_with_JSON = getStats($filepath, 'JSON');
  print $seq_stats_with_JSON->{json}, "\n";
  # Will print:
  # {
  #    "status" : 1,
  #    "seqs" : 6,
  #    <...>
  #    "filename" : "small_test.fa",
  #    "N50" : "65",
  # }

METHODS

getN50(filepath)

This function returns the N50 for a FASTA/FASTQ file given, or 0 in case of error(s).

getStats(filepath, alsoJSON)

Calculates N50 and basic stats for <filepath>. Returns also JSON if invoked with a second parameter. This function return a hash reporting:

size (int)

total number of bp in the files

N50 (int)

the actual N50

min (int)

Minimum length observed in FASTA/Q file

max (int)

Maximum length observed in FASTA/Q file

seqs (int)

total number of sequences in the files

filename (string)

file basename of the input file

dirname (string)

name of the directory containing the input file

json (string: JSON pretty printed)

(pretty printed) JSON string of the object (only if JSON is installed)

jsonStats(filepath)

Returns the JSON string with basic stats (same as $result->{json} from getStats(File, JSON)). Requires JSON::PP installed.

_n50fromHash(hash, totalsize)

This is an internal helper subroutine that perform the actual N50 calculation, hence its addition to the documentation. Expects the reference to an hash of sizes $size{SIZE} = COUNT and the total sum of sizes obtained parsing the sequences file. Returns N50, min and max lengths.

Dependencies

JSON::PP (core)
Term::ANSIColor (optional) for the n50.pl script

AUTHOR

Andrea Telatin <andrea@telatin.com>, Quadram Institute Bioscience

COPYRIGHT AND LICENSE

This free software under MIT licence. No warranty, explicit or implicit, is provided.

AUTHOR

Andrea Telatin <andrea.telatin@quadram.ac.uk>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2019 by Andrea Telatin.

This is free software, licensed under:

  The MIT (X11) License