Bio::Tools::Fasta.pm - Bioperl Fasta utility object
This module is included with the central Bioperl distribution:
http://bio.perl.org/Core/Latest ftp://bio.perl.org/pub/DIST
Follow the installation instructions included in the README file.
Bio::Tools::Fasta.pm cannot yet build sequence analysis objects given output from the FASTA program. This module can only be used for parsing Fasta multiple sequence files. This situation may change.
If $file is not a valid filename, data will be read from STDIN. See the parse() method for a complete description of parameters.
use Bio::Tools::Fasta qw(:obj); $seqCount = $Fasta->parse(-file => $file, -seqs => \@seqs, -ids => \@ids, -edit_id => 1, -edit_seq => 1, -descs => \@descs, -filt_func => \&filter_seq # filter input sequences. -exec_func => \&process_seq # process each seq as it is parsed. );
The Bio::Tools::Fasta.pm module, in its present incarnation, encapsulates data and methods for managing Fasta multiple sequence files (reading, parsing). It does not yet work with output from the Fasta sequence analysis program ("References & Information about the FASTA program").
The documentation of this module is incomplete. For some examples of usage, see the DEMO SCRIPTS section.
Unlike "Blast", the term "Fasta" is ambiguous since it refers to both a sequence file format and a sequence analysis utility (I use "FASTA" to refer to the program; "Fasta" for the file format). Ultimately, this module will be able to work with both Fasta sequence files as well as result files generated by FASTA sequence analysis, analogous to the way the Bio::Tools::Blast.pm object is used for working with Blast output.
WEBSITES:
ftp://ftp.virginia.edu/pub/fasta/ - FASTA software http://www2.ebi.ac.uk/fasta3/ - FASTA server at EBI
PUBLICATIONS: (with PubMed links)
Pearson W.R. and Lipman, D.J. (1988). Improved tools for biological sequence comparison. PNAS 85:2444-2448
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=3162770&form=6&db=m&Dopt=b
Pearson, W.R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183:63-98.
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=2156132&form=6&db=m&Dopt=b
A simple demo script is included with the central Bioperl distribution (INSTALLATION) and is also available from:
http://bio.perl.org/Core/Examples/seq/
Bio::Tools::Fasta.pm is a concrete class that inherits from Bio::Tools::SeqAnal.pm. This module also relies on Bio::Seq.pm for producing sequence objects.
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
vsns-bcd-perl@lists.uni-bielefeld.de - General discussion vsns-bcd-perl-guts@lists.uni-bielefeld.de - Technically-oriented discussion http://bio.perl.org/MailList.html - About the mailing lists
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:
bioperl-bugs@bio.perl.org http://bio.perl.org/bioperl-bugs/
Steve A. Chervitz, sac@genome.stanford.edu
Bio::Tools::Fasta.pm, 0.014
Copyright (c) 1998 Steve A. Chervitz. All Rights Reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Bio::Tools::SeqAnal.pm - Sequence analysis object base class. Bio::Seq.pm - Biosequence object Bio::Root::Object.pm - Proposed base class for all Bioperl objects. http://bio.perl.org/Projects/modules.html - Online module documentation http://bio.perl.org/ - Bioperl Project Homepage
"References & Information about the FASTA program".
Incorporate code for parsing Fasta sequence analysis reports.
Improve documentation.
Methods beginning with a leading underscore are considered private and are intended for internal use by this module. They are not considered part of the public interface and are described here for documentation purposes only.
Usage : n/a; automatically called by Bio::Root::Object::new() Purpose : Calls superclass constructor. Returns : n/a Argument : Named parameters passed to new() are processed by this method. : At present, none are processed.
See Also : Bio::Tools::SeqAnal::_initialize()
Usage : $fasta_obj->$parse( %named_parameters) Purpose : Parse a set of Fasta sequences or Fasta reports from a file or STDIN. : (Currently only Fasta sequence parsing is supported). Returns : Integer (number of sequences or Fasta reports parsed). Argument : Named parameters: (TAGS CAN BE UPPER OR LOWER CASE) : -FILE => string (name of file containing Fasta-formatted sequences. : Optional. If a valid file is not supplied, : STDIN will be used). : -SEQS => boolean (true = parse a Fasta multi-sequence file : false = parse a Fasta sequence analysis report). : -IDS => array_ref (optional). : -DESCS => array_ref (optional). : -EDIT_ID => boolean (true = edit sequence identifiers). : -EDIT_SEQ => boolean (true = edit sequence data). : -TYPE => string (type of sequences to be processed: : 'dna', 'rna', 'amino'), : -FILT_FUNC => func_ref (reference to a function for filtering out : sequences as they are being parsed. : This function should return a boolean : (true if the sequence should be filtered out) : and accept three arguments as shown : in this sample filter function: : sub filt { : my($len, $id, $desc); : # $len is the sequence length : return ($len < 25 and $id =~ /^123/); : } : This function will screen out any sequence : less than 25 in length and having an id : starting with '123'. : -SAVE_ARRAY => array_ref (reference to an array for storing all : sequence objects as they are created.) : -EXEC_FUNC => func_ref (reference to a function for processing each : sequence object) as it is parsed. : When working with sequences, this function : should accept a Bio::Seq.pm object as its : sole argument. Return value will be ignored). : -STRICT => boolean (increases sensitivity to errors). : : ---------------------------------------------------------------- : NOTE: Parameters such as seqs, ids, desc, edit_id, edit_seq, type : are used only when parsing Fasta sequence files. : Additional parameters will be added as necessary for : parsing Fasta sequence analysis reports. : : NOTE: When retreiving sequence data instead of objects, : the -SEQS, -IDS, and -DESCS parameters should all be array refs. : This constitutes a signal that sequence objects are not : to be constructed. : Throws : Propagates any exceptions thrown by _parse_seq_stream() Comments : WORKING WITH SEQUENCE DATA: --------------------------- The parse method can return sequence data bundled into Bio::Seq.pm objects or in raw format (separate arrays for seq, id, and desc data). The reason for this is that in some cases, you don't particularly need to work with sequence objects and it is inefficient to build objects just to have them broken apart. However, there is something to be said for choosing one approach -- always return seq objects. In this way, the object becomes the basic unit of exchange. For now, both options are allowed. The story will be different for Fasta sequence analysis report objects since these are a much more complex data type and it would be unwieldy and dangerous to return parsed data unencapsulated from an object.
See Also : _parse_seq_stream(), _set_id_desc(), _get_parse_seq_func()
Usage : n/a. Internal method called by parse() Purpose : Obtains the function to be used during parsing and calls read(). Returns : Integer (the number of sequences read) Argument : Named parameters (forwarded from parse()) Throws : Propagates any exception thrown by _get_parse_seq_func() and read(). Comments : This method permits the sequence data to be parsed as it is being read in. The motivation here is that when working with a potentially huge set of sequences, there is no need to read them all into memory before you start processing them. In fact, you may only be interested in a few of them. This method constructs and returns a closure for parsing a single Fasta sequence. It is called automatically by the read() method inherited from Bio::Root::Object.pm. Another issue concerns what to do with the parsed data: save it or use it? Sometimes you need to process all sequence data as a group (eg., sorting). Other times, you can safely process each sequence as it gets parsed and then move on to the next. By delivering each sequence as it gets parsed, the client is free to decide what to do with it.
See Also : _get_parse_seq_func(), Bio::Root::Object::read()
Usage : n/a. Internal method called by _parse_seq_stream() Purpose : Generates a function reference to be used for parsing raw sequence data : as it is being loaded by read(). : Used when parsing Fasta sequence files. Returns : Function reference (actually a closure) Argument : Named parameters forwared from _parse_seq_stream() Throws : Exceptions due to improper argument types. : (to be elaborated...) Comments : The function generated performs sequence editing if : the -EDIT_SEQ parse() parameter is is non-zero. : This consists of removing any ambiguous residues at begin : or end of seq. : Regardless of -EDIT_SEQ, all sequence will be edited to remove : whitespace and non-alphabetic chars. : Gaps characters are permitted ('.' and '-'). : (Need a more universal way to identify gap characters.) : If sequence objects are generated and an -EXEC_FUNC is supplied, : each object will be destroyed after calling this function. : This prevents memory usage problems for large runs.
See Also : parse(), _parse_seq_stream(), Bio::Root::Object::_rearrange()
Usage : $fasta_obj->edit_id() Purpose : Set/Get a boolean indicator as to whether sequence IDs should be edited. : Used when parsing Fasta sequence files. Returns : Boolean (true if the IDs are to be edited). Argument : Boolean (optional) Throws : n/a
See Also : _set_id_desc(), _get_parse_seq_func()
Usage : $fasta_obj->edit_seqs() Purpose : Set/Get a boolean indicator as to whether sequences should be edited. : Used when parsing Fasta sequence files. Returns : Boolean (true if the sequences are to be edited). Argument : Boolean (optional) Throws : n/a
See Also : _get_parse_seq_func()
Usage : n/a. Internal method called by _get_parse_seq_func() Purpose : Sets the _id and _desc data members, optionally editing the id. : Used when parsing Fasta sequence files. Returns : 2-element list containing: ($id, $description) Argument : String containing raw ID + description (leading '>' will be stripped) Throws : n/a Comments : Optionally edits the ID if the '_edit_id' field is true. : Descriptions are not altered. : ID Edits: : 1) Uppercases the ID. : 2) If the ID has any | characters the following is performed: : a) Replace | characters with _ characters. : (prevent regexp and shell trouble). : b) Cleans up complex identifiers. : Some GenBank specifiers have multiple parts: : >gi|2980872|gnl|PID|e1283615 homeobox protein SHOTb : Only the first ID is saved as the official ID. : Extra ids will be included at the end of the : description between brackets: : GI_2980872 homeobox protein SHOTb [ GNL PID e1283615 ] : : ID editing is somewhat experimental.
See Also : _get_parse_seq_func(), edit_id()
Usage : $fasta_obj->num_seqs() Purpose : Get the number of sequences read by the Fasta object. Returns : Integer Argument : n/a Throws : n/a
Information about the various data members of this module is provided for those wishing to modify or understand the code. Two things to bear in mind:
All data members are prefixed with an underscore to signify that they are private. Always use accessor methods. If the accessor doesn't exist or is inadequate, create or modify an accessor (and let me know, too!).
It is easy for these data member descriptions to become obsolete as this module is still evolving. Always double check this info and search for members not described here.
An instance of Bio::Tools::Fasta.pm is a blessed reference to a hash containing all or some of the following fields:
FIELD VALUE -------------------------------------------------------------- _seqCount Number of sequences parsed. _edit_seq Boolean. Should sequences be edited during parsing? _edit_id Boolean. Should ids be edited during parsing? More data members will be added when code for Fasta report processing is incorporated. INHERITED DATA MEMBERS
(See Bio::Tools::SeqAnal.pm for inherited data members.)
To install Bio::Seq, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Seq
CPAN shell
perl -MCPAN -e shell install Bio::Seq
For more information on module installation, please visit the detailed CPAN module installation guide.