++ed by:
PROCH

1 PAUSE user
2 non-PAUSE users.

Andrea Telatin ⚗️
and 1 contributors

NAME

n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.

VERSION

version 0.80

SYNOPSIS

  n50.pl [options] [FILE1 FILE2 FILE3...]

DESCRIPTION

This program parses a list of FASTA/FASTQ files calculating for each one the number of sequences, the sum of sequences lengths and the N50. It will print the result in different formats, by default only the N50 is printed for a single file and all metrics in TSV format for multiple files.

PARAMETERS

-o, --sortby

Sort by field: 'N50' (default), 'min', 'max', 'seqs', 'size', 'path'. By default will be descending for numeric fields, ascending for 'path'. See -r, --reverse.

-r, --reverse

Reverse sort (see: -o);

-f, --format

Output format: default, tsv, json, custom, screen. See below for format specific switches. Specify "list" to list available formats.

-s, --separator

Separator to be used in 'tsv' output. Default: tab. The 'tsv' format will print a header line, followed by a line for each file given as input with: file path, as received, total number of sequences, total size in bp, and finally N50.

-b, --basename

Instead of printing the path of each file, will only print the filename, stripping relative or absolute paths to it. See -a. Warning: if you are reading multiple files with the same basename, only one will be printed. This is the intended behaviour and you will only receive a warning.

-a, --abspath

Instead of printing the path of each file, as supplied by the user (can be relative), it will the absolute path. Will override -b (basename). See -b.

-u, --noheader

When used with 'tsv' output format, will suppress header line.

-n, --nonewline

If used with 'default' (or 'csv' output format), will NOT print the newline character after the N50 for a single file. Useful in bash scripting:

  n50=$(n50.pl filename);
-t, --template

String to be used with 'custom' format. Will be used as template string for each sample, replacing {new} with newlines, {tab} with tab and {N50}, {seqs}, {size}, {path} with sample's N50, number of sequences, total size in bp and file path respectively (the latter will respect --basename if used).

-q, --thousands-sep

Add the thousands separator in all the printed numbers. Enabled by default with --format screen (-x).

-p, --pretty

If used with 'json' output format, will format the JSON in pretty print mode. Example:

 {
   "file1.fa" : {
     "size" : 290,
     "N50"  : 290,
     "seqs" : 2
  },
   "file2.fa" : {
     "N50"  : 456,
     "size" : 456,
     "seqs" : 2
  }
 }
-h, --help

Will display this full help message and quit, even if other arguments are supplied.

Output formats

These are the values for --format.

tsv (tab separated values)
  #path       seqs  size  N50   min   max
  test2.fa    8     825   189   4     256
  reads.fa    5     247   100   6     102
  small.fa    6     130   65    4     65
csv (comma separated values)

Same as --format tsv and --separator ,:

  #path,seqs,size,N50,min,max
  test.fa,8,825,189,4,256
  reads.fa,5,247,100,6,102
  small_test.fa,6,130,65,4,65
screen (screen friendly)

Use -x as shortcut for --format screen. Enables --thousands-sep (-q) by default.

  .----------------------------------------------------------------.
  | File               | Seqs   | Total bp   | N50   | min | max   |
  +--------------------+--------+------------+-------+-----+-------+
  | test_fasta_grep.fa |      1 |         80 |    80 |  80 |    80 |
  | small_test.fa      |      6 |        130 |    65 |   4 |    65 |
  | rdp_16s_v16.fa     | 13,212 | 19,098,167 | 1,467 | 320 | 2,210 |
  '--------------------+--------+------------+-------+-----+--------'
json (JSON)

Use -j as shortcut for --format json.

  {
    "small_test.fa" : {
       "max"  : 65,
       "N50"  : 65,
       "seqs" : 6,
       "size" : 130,
       "min"  : 4
    },
    "rdp_16s_v16.fa" : {
       "seqs" : 13212,
       "N50"  : 1467,
       "max"  : 2210,
       "min"  : 320,
       "size" : 19098167
    }
  }
custom

Will print the output using the template string provided with -t TEMPLATE. Fields are in the {field_name} format. {new}/{n}/\n is the newline, {tab}/{t}/\t is a tab. All the keys of the JSON object are valid fields: {seqs}, {N50}, {min}, {max}, {size}.

EXAMPLE USAGES

Screen friendly table (-x is a shortcut for --format screen), sorted by N50 descending (default):

  n50.pl -x files/*.fa

Screen friendly table, sorted by total contig length (--sortby max) ascending (--reverse):

  n50.pl -x -o max -r files/*.fa

Tabular (tsv) output is default:

  n50.pl -o max -r files/*.fa

A custom output format:

  n50.pl data/*.fa -f custom -t '{path}{tab}N50={N50};Sum={size}{new}'

COPYRIGHT

Copyright (C) 2017-2019 Andrea Telatin

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

AUTHOR

Andrea Telatin <andrea@telatin.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2019 by Andrea Telatin.

This is free software, licensed under:

  The MIT (X11) License