bioseq - Manipulation of FASTA sequences based on BioPerl
bioseq [options] input_file
bioseq [-h | --help | -V | --version | --man]
bioseq -l fasta_file # [l]engths of sequences bioseq -n fasta_file # [n]umber of sequences bioseq -c fasta_file # base or aa [c]omposition
bioseq -r fasta_file # [r]everse-complement sequences bioseq -p 'order:3' fasta_file # pick the 3rd sequences bioseq -p 're:B31' fasta_file # pick sequences with regex bioseq -d 'order:3' fasta_file # delete the 3rd sequences bioseq -d 're:B31' fasta_file # delete sequences with regex bioseq -t 1 dna_fasta # translate in 1st reading frame bioseq -t 3 dna_fasta # translate in 3 reading frames bioseq -t 6 dna_fasta # translate in 6 reading frames bioseq -g fasta_file # remove gaps
bioseq -s '1,10' fasta_file # sub-sequence from positions 1-10 bioseq --reloop '10' contig_fasta # re-circularize a genome at position 10
bioseq --linearize fasta_file # Linearize FASTA: one sequence per line bioseq --break fasta_file # Break into single-seq files bioseq --count-codons cds_fasta # Codon counts (for coding sequences) bioseq --hydroB pep_fasta # Hydrophobicity score (for protein seq) bioseq --input 'genbank' --feat2fas file.gb # extract genbank features to FASTA # bioseq --restrict 'EcoRI' dna_fasta # Fragments from restriction digest (disabled)
bioseq -p 'id:B31' dna_fasta | bioseq -g | bioseq -t1 # pick a seq, remove gaps & translate bioseq -p 'order:2' dna_fasta | bioseq -r | bioseq -s '10,20' # pick the 2nd seq, rev-com & subseq
bioseq is a command-line utility for common, routine sequence manipulations based on BioPerl modules including Bio::Seq, Bio::SeqIO, Bio::SeqUtils, and Bio::Tools::SeqStats.
By default, bioseq assumes that both the input and the output files are in FASTA format, to facilitate the chaining (by UNIX pipes) of serial bioseq runs.
Methods that are currently not wrappers should ideally be factored into individual Bio::Perl modules, which are better tested and handle exceptions better than stand-alone codes in the Bio::BPWrapper package. As a design principle, command-line scripts here should consist of only wrapper calls.
This options was designed for legacy programs (e.g., PHLIP suites) that takes only 10 character-long sequence IDs.
Replace sequence IDs with serial IDs n characters long. The sequence is prefaced with a leading 'S'. For example using option --anonymize '5' the first ID will be S0001.
'S'
--anonymize '5'
S0001
A sed script file with a .sed suffix that may be used with sed's -f argument. If the filename is -, the sed file is named STDOUT.sed instead. A message containing the sed filename is written to STDERR.
.sed
-f
-
STDOUT.sed
STDERR
Break into individual sequences, writing a FASTA file for each sequence.
Calculate codon bias of a CDS using Shannon information, as deviation from the genome-wide codon usage (in CUTG GCG format). (Reference: <to be added>)
(Temporarily disabled)
Output a CDS with the same AA sequence with each AA replaced by a synonymous codon randomly chosen from a genome codon pool (specified by CUTG GCG file; see test-files/BbB31.cutg). Only the 1st sequence used if multiple sequences are supplied.
For testing the signficcance of codon-bias information, it is necessary to run this repeatedly to generate at least 100 simulated CDSs, e.g., for i in {1..10}; do bioseq --codon-sim test-files/BbB31.cutg test-files/test-bioseq.cds; done
Two non-BioPerl module dependencies: Algorithm::Numerical::Sample; Math::Random.
Interface to Bio::Tools::CodonTable. Methods include: translate a codon to AA & reverse list codons for an AA. Currently only takes a 3-letter DNA-base codon or a 1-letter uppercase IUPAC aa code
--codon-table 'ACG' --codon-table 'L'
Base or AA composition.
Count codons for coding sequences (e.g., a genome file consisting of CDS sequences).
Delete a sequence or a comma-separated list of sequences, e.g.,
--delete id:foo # by id --delete order:2 # by order --delete length:n # by min length, where 'n' is length --delete ambig:x # by max number of x ambiguous bases (non-ATCGs), e.g., if x=20, delete seqs w/ 20 or mroe N's --delete id:foo,bar # list by id --delete re:REGEX # using a regular expression (only one regex is expected) --delete file:name # by file, one id per line
Extract gene sequences in FASTA from a GenBank file of bacterial genome. Won't work for a eukaryote genbank file. For example:
bioseq -i'genbank' -F <genbank_file.gb> (throws error if not genbank file)
Retrieves a sequence from GenBank using the provided accession number, e.g., bioseq -f 'NC_003078' -o 'genbank'
Return the mean Kyte-Doolittle hydropathicity for protein sequences.
Return iso-electric point for a protein sequences as well as charges at a series of pH values. Depends on Bio::Tools::pICalculator.
Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. For FASTQ, use 'fastq'
[We tried to guess the format using Bio::Tools::GuessSeqFormat, but it didn't work for pipe. Guess format will be delayed until this issue is fixed]
Count and return the number of leading gaps in each sequence.
Print all sequence lengths.
Linearize FASTA, one sequence per line.
Find and return the longest ORF (return the original if no error for -t1, i.e., no internal stop at the 1st reading frame). This is useful for fixing out-of-frame seqs. Turn on -Z (no revcom to search only in the given strand)
Print lower and upper bound of molecular weight
Print number of non-ATCG's for a dna sequence.
Redirect STDERR to a file to view non-ATCG positions, e.g.,: bioseq --num-gaps-dna seq.fas > out 2> log
Print number of non-AA's for a protein sequence.
Redirect STDERR to a file to view non-AA positions, e.g.,: bioseq --num-gaps-aa seq.pep > out 2> log
Remove gaps
Print number of sequences.
Output file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'.
Select a single sequence:
--pick 'id:foo' by id --pick 'order:2' by order --pick 're:REGEX' using a regular expression
Select a list of sequences:
--pick 'id:foo,bar' list by id --pick 'order:2,3' list by order --pick 'order:2-10' list by range --pick 'file:name' id list in file, one id per line
Re-circularize a bacterial genome by starting at a specified position. For example, for sequence "ABCDE", bioseq -R'2' would generate "BCDEA".
bioseq -R'2'
Append sequence names specified by a file (two tab-separated columns: old_name, new_name), or a single id
Remove stop codons (e.g., for PAML input)
Predicted fragments from digestion by a specified restriction enzyme. Disabled (not part of bioperl dist)
Predicted fragments from digestion by a specified restriction enzyme. Outputs cooridnates of overhangs in BED format. Disabled (not part of bioperl dist)
Reverse complement.
Sort by id, length, or a file with an intended order of seq ids (Contributor: Jeffery Rosario; Fall 2017)
Parse cdhit output .clstr file and generate a FASTA file for each CDHIT family.
Select substring (of the 1st sequence).
Randomly substitute each codon with a synonymous one for a coding sequence
Translate in 1, 3, or 6 frames. e.g., -t1, -t3, or -t6.
Print a brief help message and exit.
Print the manual page and exit.
Print current release version of this command and exit.
Bio::BPWrapper::SeqManipulations, the underlying Perl Module
Qiu Lab wiki page
Github project wiki page
Lawrence et al (2015). FAST: FAST analysis of sequences toolbox. Front. Genet. 6:172. weblink
Yözen Hernández <yzhernand at gmail dot com> (Initial desgin and implementation)
Girish Ramrattan <gramratt at gmail dot com> (developer)
Levy Vargas <levy dot vargas at gmail dot com> (developer)
Weigang Qiu (Maintainer)
Rocky Bernstein (testing and release)
Filipe G. Vieira (developer of --restrict; --restrict-coord methods)
Add bioperl scripts ("bp_xxx.pl") functions?
Hernandez, Bernstein, Qiu, et al (2017). "BpWrappers: Command-line utilities for manipulation of sequences, alignments, and phylogenetic trees based on BioPerl". (In prep).
Stajich et al (2002). "The BioPerl Toolkit: Perl Modules for the Life Sciences". Genome Research 12(10):1611-1618.
To install Bio::BPWrapper, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::BPWrapper
CPAN shell
perl -MCPAN -e shell install Bio::BPWrapper
For more information on module installation, please visit the detailed CPAN module installation guide.