The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

bioseq - FASTA sequence utility based on Bio::Perl

SYNOPSIS

bioseq options [input_file]

bioseq [-h | --help | --v | --version --man]

bioseq is a command-line utility for common, routine sequence manipulations. Most methods are wrappers for Bio::Perl modules: Bio::Seq, Bio::SeqIO, Bio::SeqUtils, and Bio::Tools::SeqStats.

By default, bioseq assumes that both the input and the output files are in FASTA format, to facilitate the chainning (by UNIX pipes) of multiple bioseq runs.

Methods that are currently not wrappers should ideally be factored into individual Bio::Perl modules, which are better tested and handle exceptions better than stand-alone codes in the Bio::BPWrapper package. As a design principle, command-line scripts here should consist of only wrapper calls.

Options

--composition, -c <input_file>

Base or AA composition. A wrapper for Bio::Tools::SeqStats->count_monomers

--delete, -d 'tag:value' <input_file>

Delete a sequence or a comma-separated list of sequences, e.g.,

   --delete id:foo       # by id
   --delete order:2      # by order
   --delete length:n     # by min length, where 'n' is length
   --delete ambig:x      # by min % ambiguous base/aa, where 'x' is the %
   --delete id:foo,bar   # list by id
   --delete re:REGEX     # using a regular expression (only one regex is expected)
--fetch, -f <genbank_accession>

Retrieves a sequence from GenBank using the provided accession number. A wrapper for Bio::DB::GenBank>#get_Seq_by_acc.

--nogaps, -g <input_file>

Remove gaps

--input, -i <input_file>

Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wraps Bio::SeqIO.

--length, -l <input_file>

Print all sequence lengths. Wraps Bio::Seq->length.

--numseq, -n.

Print number of sequences.

--output, -o 'format' <input_file>

Output file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wraps Bio::SeqIO.

--pick, -p

Select a single sequence:

   --pick 'id:foo'        by id
   --pick 'order:2'       by order
   --pick 're:REGEX'      using a regular expression

Select a list of sequences:

   --pick 'id:foo,bar'    list by id
   --pick 'order:2,3'     list by order
   --pick 'order:2-10'    list by range

Usage: bioseq -p 'tag:value' <input_file>

--revcom | -r <input_file>

Reverse complement. Wraps Bio::Seq->revcom().

--subseq | -s 'beginning_index, ending_index' <input_file>

Select substring (of the 1st sequence). Wraps Bio::Seq->subseq(). For example:

   bioseq --subseq 20,80 <input_file>
--translate | -t [1|3|6] <input_file>

Translate in 1, 3, or 6 frames. eg, -t1, -t3, or -t6. Wraps Bio::Seq->translate(), Bio::SeqUtils->translate_3frames(), and Bio::SeqUtils->translate_6frames().

--restrict | -x 'RE' <dna_fasta_file>

Predicted fragments from digestion by a specified restriction enzyme. An input file with a single sequence is expected. Wraps Bio::Restriction::Analysis->cut().

--anonymize | -A 'number' <input_file>

Replace sequence IDs with serial IDs n characters long. The sequence is prefaced with a leading 'S'.

For example using option --anonymize '5' the first ID will be S0001.

A sed script file with a .sed suffix that may be used with sed's -f argument. If the filename is -, the sed file is named STDOUT.sed instead. A message containing the sed filename is written to STDERR.

--break | -B <input_file>

Break into individual sequences, writing a FASTA file for each sequence.

--count-codons | -C <input_file>

Count codons for coding sequences (e.g., a genome file consisting of CDS sequences). Wraps Bio::Tools::SeqStats->count_codons().

--feat2fas | -F

Extract gene sequences in FASTA from a GenBank file of bacterial genome. Won't work for a eukaryote genbank file. For example:

   bioseq --input genbank --feat2fas <genbank_file>
--leadgaps | -G <input_file>

Count and return the number of leading gaps in each sequence.

--hydroB, -H

Return the mean Kyte-Doolittle hydropathicity for protein sequences. Wraps Bio::Tools::SeqStats->hydropathicity().

--linearize, -L <input_file>

Linearize FASTA, one sequence per line.

--reloop, -R

Re-circularize a bacterial genome by starting at a specified position. For example for sequence "ABCDE". bioseq -R'2' .. would generate"'BCDEA".

 bioseq --reloop 'number' <input_file>
--removestop, -X

Remove stop codons (e.g., PAML input)

   bioseq --removestop <input_file>
--split-cdhit

Common Options

--help, -h

Print a brief help message and exit.

--man

Print the manual page and exit.

--version, -V

Print current release version of this command and exit.

--man (but not "-m")

Print the manual page and exit.

EXAMPLES

FASTA descriptors

 bioseq --length fasta_file       # lengths of sequences
 bioseq --numseq fasta_file       # number of sequences
 bioseq --composition fasta_file  # base or aa composition of sequences
xo
=head2 FASTA filters

These take a FASTA-format file as input and output one or more FASTA-format file.

Multiple FASTA-file output

 bioseq --revcom fasta_file          # reverse-complement sequences
 bioseq --pick 'order:3' fasta_file  # pick the 3rd sequences
 bioseq --pick 're:B31' fasta_file   # pick sequences with regex
 bioseq --delete order:3 fasta_file  # delete the 3rd sequences
 bioseq --delete re:B31 fasta_file   # delete sequences with regex
 bioseq --translate 1 dna_fasta      # translate in 1st reading frame
 bioseq --translate 3 dna_fasta      # translate in 3 reading frames
 bioseq --translate 6 dna_fasta      # translate in 6 reading frames
 bioseq --nogaps fasta_file          # remove gaps
 bioseq --anonymize fasta_file       # Anonymize sequence IDs

Single FASTA-file output

 bioseq --subseq 1,10 fasta_file       # subsequence from positions 1-10
 bioseq --reloop 10 bac_genome_fasta   # re-circularize a genome t position 10

 # Retrieve sequence from database
 bioseq --fetch X83553 --output genbank  # fetch a genbank file by accession
 bioseq --fetch X83553 --output fasta    # fetch a genbank file in FASTA

 # Less common usages
 bioseq --linearize fasta_file    # Linearize FASTA: one sequence per line
 bioseq --break fasta_file        # Break into single-seq files
 bioseq --count-codons cds_fasta  # Codon counts (for coding sequences)
 bioseq --hydroB pep_fasta        # Hydrophobicity score (for protein seq)
 bioseq --input genbank --feat2fas file.gb  # extract genbank features to FASTA
 bioseq --restrict EcoRI dna_fasta            # Fragments from restriction digest

Examples involving Unix pipes

 bioseq --pick id:B31 dna_fasta | bioseq -nogaps | bioseq --translate 1 # pick a seq, remove gaps, & translate
 bioseq --pick order:2 dna_fasta | bioseq -r | bioseq --subseq 10,20    # pick the 2nd seq, rev-com it, & subseq

SEE ALSO

CONTRIBUTORS

  • Yözen Hernández yzhernand at gmail dot com

  • Girish Ramrattan <gramratt at gmail dot com>

  • Levy Vargas <levy dot vargas at gmail dot com>

  • Weigang Qiu (Maintainer)

  • Rocky Bernstein