The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.

VERSION

version 1.5.0

SYNOPSIS

  n50.pl [options] [FILE1 FILE2 FILE3...]

DESCRIPTION

This program parses a list of FASTA/FASTQ files calculating for each one the number of sequences, the sum of sequences lengths and the N50, N75, N90 and auN*. It will print the result in different formats, by default only the N50 is printed for a single file and all metrics in TSV format for multiple files.

*: See https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity

PARAMETERS

-o, --sortby

Sort by field: 'N50' (default), 'min', 'max', 'seqs', 'size', 'path'. By default will be descending for numeric fields, ascending for 'path'. See -r, --reverse.

-r, --reverse

Reverse sort (see: -o);

-f, --format

Output format: default, tsv, json, custom, screen. See below for format specific switches. Specify "list" to list available formats.

-e

Also calculate a custom N{e} metric. Expecting an integer 0 < e < 100.

-s, --separator

Separator to be used in 'tsv' output. Default: tab. The 'tsv' format will print a header line, followed by a line for each file given as input with: file path, as received, total number of sequences, total size in bp, and finally N50.

-b, --basename

Instead of printing the path of each file, will only print the filename, stripping relative or absolute paths to it. See -a. Warning: if you are reading multiple files with the same basename, only one will be printed. This is the intended behaviour and you will only receive a warning.

-a, --abspath

Instead of printing the path of each file, as supplied by the user (can be relative), it will the absolute path. Will override -b (basename). See -b.

-u, --noheader

When used with 'tsv' output format, will suppress header line.

-n, --nonewline

If used with 'default' (or 'csv' output format), will NOT print the newline character after the N50 for a single file. Useful in bash scripting:

  n50=$(n50.pl filename);
-t, --template

String to be used with 'custom' format. Will be used as template string for each sample, replacing {new} with newlines, {tab} with tab and {N50}, {seqs}, {size}, {path} with sample's N50, number of sequences, total size in bp and file path respectively (the latter will respect --basename if used).

-q, --thousands-sep

Add the thousands separator in all the printed numbers. Enabled by default with --format screen (-x).

-p, --pretty

If used with 'json' output format, will format the JSON in pretty print mode. Example:

 {
   "file1.fa" : {
     "size" : 290,
     "N50"  : 290,
     "seqs" : 2
  },
   "file2.fa" : {
     "N50"  : 456,
     "size" : 456,
     "seqs" : 2
  }
 }
-h, --help

Will display this full help message and quit, even if other arguments are supplied.

Output formats

These are the values for --format.

tsv (tab separated values)
  #path       seqs  size  N50   min   max
  test2.fa    8     825   189   4     256
  reads.fa    5     247   100   6     102
  small.fa    6     130   65    4     65
csv (comma separated values)

Same as --format tsv and --separator ,:

  #path,seqs,size,N50,min,max
  test.fa,8,825,189,4,256
  reads.fa,5,247,100,6,102
  small_test.fa,6,130,65,4,65
screen (screen friendly)

Use -x as shortcut for --format screen. Enables --thousands-sep (-q) by default.

  .-----------------------------------------------------------------------------------------.
  | File          | Seqs | Total bp | N50    | min   | max    | N75   | N90   | auN         |
  +---------------+------+----------+--------+-------+--------+-------+-------+-------------+
  | big.fa        |    4 |   18,359 | 11,840 | 2,167 | 11,840 | 2,176 | 2,167 | 8923.21,984 |
  | sim1.fa       |   39 |   18,864 |    679 |    20 |    971 |   408 |   313 |  733.51,389 |
  | sim2.fa       |   21 |    7,530 |    493 |    68 |    989 |   330 |   174 |  575.47,012 |
  | test.fa       |    8 |      825 |    189 |     4 |    256 |   168 |   168 |  260.99,515 |
  '---------------+------+----------+--------+-------+--------+-------+-------+-------------'
json (JSON)

Use -j as shortcut for --format json.

    {
       "data/sim1.fa" : {
          "seqs" : 39,
          "N50" : 679,
          "max" : 971,
          "N90" : 313,
          "min" : 20,
          "size" : 18864,
          "auN" : 733.51389,
          "N75" : 408
       },
       "data/sim2.fa" : {
          "max" : 989,
          "seqs" : 21,
          "N50" : 493,
          "N90" : 174,
          "min" : 68,
          "auN" : 575.47012,
          "N75" : 330,
          "size" : 7530
       }
    }
custom

Will print the output using the template string provided with -t TEMPLATE. Fields are in the {field_name} format. {new}/{n}/\n is the newline, {tab}/{t}/\t is a tab. All the keys of the JSON object are valid fields: {seqs}, {N50}, {min}, {max}, {size}.

EXAMPLE USAGES

Screen friendly table (-x is a shortcut for --format screen), sorted by N50 descending (default):

  n50.pl -x files/*.fa

Screen friendly table, sorted by total contig length (--sortby max) ascending (--reverse):

  n50.pl -x -o max -r files/*.fa

Tabular (tsv) output is default:

  n50.pl -o max -r files/*.fa

A custom output format:

  n50.pl data/*.fa -f custom -t '{path}{tab}N50={N50};Sum={size}{new}'

CITING

Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059

CONTRIBUTING, BUGS

The repository of this project is available at https://github.com/telatin/proch-n50/.

AUTHOR

Andrea Telatin <andrea@telatin.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018-2022 by Andrea Telatin.

This is free software, licensed under:

  The MIT (X11) License