Bio::Seq - bioperl sequence object
$seq = Bio::Seq->new; $seq = Bio::Seq->new(-seq=>'ACTGTGGCGTCAACTG'); $seq = Bio::Seq->new(-seq=>$sequence_string); $seq = Bio::Seq->new(-seq=>@character_list); $seq = Bio::Seq->new($file,$seq,$id,$desc,$names, $numbering,$type,$ffmt,$descffmt);
There are two ways to create Bio::Seq objects from files. One is using internal Sequence reading routines in this object, which can handle a few formats. The second is to use the newer SeqIO system, which can handle slightly more formats, can handle multiple sequences in one file, and can be easily extended to new formats.
Try to use the new style. It does give you more flexibility and stability.
# old-style and deprecated, $seq = Bio::Seq->new($filename); # guesses Fasta format $seq = Bio::Seq->new(-file=>'seqfile.aa', -desc=>'Sample Bio::Seq sequence', -start=>'1', -ffmt=> 'Fasta', -type=>'Amino', ); # new style, better, but somewhat more wordy # notice this loops over multiple sequences $stream = Bio::SeqIO->new(-file => 'myfile' -fmt => 'Fasta'); while $seq ( $stream->next_seq() ) { # $seq is a Bio::Seq object }
$seq->[METHOD]; $result = $seq->[METHOD]; Accessors -------------------------------------------------------- There are a wide variety of methods designed to give easy and flexible access to the contents of sequence objects The following accessors can be invoked upon a sequence object ary() - access sequence (or slice of sequence) as an array str() - access sequence (or slice of sequence) as a string getseq() - access sequence (or slice) as string or array seq_len() - access sequence length id() - access/change object id desc() - access/change object description names() - access/change object names start() - access/change start point of the sequence (see note below) end() - access/change end point of the sequence (see note below) numbering() - access/change sequence numbering offset (deprecated) origin() - access/change sequence origin type() - access/change sequence type setseq() - change sequence Deprecated format changes. ffmt() - access/change default output format descffmt() - access/change description format Methods -------------------------------------------------------- The following methods can be invoked upon a sequence object copy() - returns an exact copy of an object alphabet_ok() - check sequence against genetic alphabet alphabet() - returns the genetic alphabet currently in use layout() - sequence formatter for output revcom() - reverse complement of sequence complement() - complement of sequence reverse() - reverse of sequence Dna_to_Rna() - translate Dna seq to Rna Rna_to_Dna() - translate Rna seq to Dna translate() - protein translation of Dna/Rna sequence copy, revcom and translate all return new Bio::Seq objects. This makes it easy to use these objects in other Bioperl modules and/or use all the new SeqIO system for format dumping. complement, reverse, Dna_to_Rna and Rna_to_Dna all return strings, as it is less likely that you want these things as real Seq objects
The Bio::Seq object is by far the oldest object in the bioperl set of modules, and it shows, with around 4/5 people developing methods and much of the documentation focused on general bioperl issues. The bioperl core group have a commitment to eventually rewrite the Bio::Seq object with some more sensible design principles, but this rewrite will
a) be heavily tested against old uses of the code b) aim to be as backwardly compatible as possible c) be well signposted that it is occuring.
For more information read the bioperl web page, projects, sequence object,
http://bio.perl.org/Projects/Sequence/
This module is included with the central Bioperl distribution:
http://bio.perl.org/Core/Latest ftp://bio.perl.org/pub/DIST
Follow the installation instructions included in the README file.
This module is the generic sequence object which lies at the core of the bioperl project. It stores Dna, Rna, or Protein sequence information and annotation. It has associated methods to perform various manipulations of sequences and support for a reading and writing sequence data in a variety of file formats.
Bio::Seq has completly superceeded Bio::PreSeq.pm.
The older PreSeq.pm code can be found at Chris Dagdigian's site: http://www.sonsorol.org/dag/bioperl/top.html
Currently the following sequence types are recognized:
Dna Rna Amino
This module uses the standard extended single-letter genetic alphabets to represent nucleotide and amino acid sequences.
In addition to the standard alphabet, the following symbols are also acceptable in a biosequence:
? (a missing nucleotide or amino acid) - (gap in sequence)
(includes symbols for nucleotide ambiguity) ------------------------------------------ Symbol Meaning Nucleic Acid ------------------------------------------ A A Adenine C C Cytosine G G Guanine T T Thymine U U Uracil M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X G or A or T or C N G or A or T or C IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE: Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
------------------------------------------ Symbol Meaning ------------------------------------------ A Alanine B Aspartic Acid, Asparagine C Cystine D Aspartic Acid E Glutamic Acid F Phenylalanine G Glycine H Histidine I Isoleucine K Lysine L Leucine M Methionine N Asparagine P Proline Q Glutamine R Arginine S Serine T Threonine V Valine W Tryptophan X Unknown Y Tyrosine Z Glutamic Acid, Glutamine * Terminator IUPAC-IUP AMINO ACID SYMBOLS: Biochem J. 1984 Apr 15; 219(2): 345-373 Eur J Biochem. 1993 Apr 1; 213(1): 2
You are encouraged to use the SeqIO system of IO, which in essence looks like:
use Bio::SeqIO; $instream = Bio::SeqIO->new( -file => 'my.file', -format => 'Fasta' ); $outstream = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Raw' ); while $seq ( $instream->next_seq ) { $outstream->write_seq($seq); } The available formats can be found by listing the SeqIO directory in the distribution that this comes with (as new SeqIO formats are very easy to add, it is better to go to the directory, not try to list them here).
Notice that the SeqIO system will only convert information which the Seq object stores. The Seq object is a lightweight object, and does not contain annotation or feature table information. This information is stored in a development object, called AnnSeq, which will be available in the 0.06 releases and later.
Seq.pm is invoked via the perl 'use' command
use Bio::Seq;
The "constructor" method in Bio::Seq.pm is the new() function.
The proper syntax for accessing the new() function in Seq.pm is as follows:
$myseq = Bio::Seq->new;
Of course, objects are only useful if they have something in them so you would probably want to pass along some additional information or arguments to the constructor. The foundation of any biosequence object is course the sequence itself.
You can address new() with a sequence directly:
$myseq = Bio::Seq->new(-seq=>'AACTGGCGTTCGTG');
Or you can pass in a string or a list:
$myseq = Bio::Seq->new(-seq=>$sequence_string); $myseq = Bio::Seq->new(-seq=>@sequence_list);
It is also possible to create a new sequence object based on a sequence contained in a file. You can tell constructor where to find the sequence file by passing in the 'file' parameter:
$myseq = Bio::Seq->new(-file=>'seqfile.gcg');
Because there are so many different conventions or formats for storing sequence information in files, it would be polite (although not absolutely necessary) to tell the constructor what format the sequence file is in. We can provide that information via the file-format or 'ffmt' field. To create a sequence object based upon a GCG-formatted sequence file:
$myseq = Bio::Seq->new(-file=>'seqfile.gcg',-ffmt=>'GCG');
We've already introduced three different object attributes or arguments that can be passed to the new() object constructor ('seq','file' and 'ffmt') so now would be a good time to introduce them all:
BioSeq Constructor Arguments
file: The "file" argument should be a string value containing path and filename information for a sequence file that is to be read into an object.
seq: The "seq" argument is for passing in sequence directly instead of reading in a sequence file. The sequence should consist of RAW info (no whitespace, newlines or formatting) and can be passed in as either an array/list or string.
id: The "id" argument should be a ONE-WORD string value giving a short name for the sequence.
desc: The "desc" argument should be a string containing a description of the sequence. This field is not limited to one word.
names: The "names" argument should be a hash or reference to a hash that contains any number of user generated key-value pairs. Various bits of identifying information can be stored here including name(s), database locations, accession numbers, URL's, etc.
type: The "type" argument should be a string value describing the sequence type eg; "Dna", "Rna" or "Amino".
origin: The "origin" argument should be a string value describing sequence origin info
start: The start point, in biological coordinates of the sequence
end: The end point, in biological coordinates of the last residue in the sequence
start/end attributes are not strongly tied to what is actually in the sequence (ie, $seq->start()+length($seq->getseq()) doesn't necessarily equal $seq->end()-1 - most of the time it should).
This is to allow some oddities to be stored in the Seq object sensibly.
The numbering convention is 'biological' coordinates. ie the sequence ATG would start at 1 (A) and finish at 3 (G). (NB - this is different from how perl represents ranges in sequences).
numbering() is equivalent to start() (old version). Eventually it will be removed. numbering() accesses the same attribute as start()
numbering: (Deprecated) The "numbering" argument should be an integer value containing the sequence numbering offset value. By default all sequence are numbered starting with 1.
ffmt:
This documentation describes the old format system: you are encouraged to use the newer SeqIO system described separately in the SeqIO documentation.
The "ffmt" argument should be a string describing sequence file-format. If a sequence is being read from a file via the "file" argument, "ffmt" is used to invoke the proper parsing code. "ffmt" is also the default format for sequence output when the layout method is called. See elsewhere in this documentation for info regarding recognized sequence file-formats.
If most of these arguments were used at once to create a sequence object, it would look something like this:
#Set up the name hash %names = ( 'CloneID','DB1', 'Isolate','5', 'Tissue','Xenopus', 'Location','/usr2/users/dag/bioperl/sample.tfa' ); $name_ref = \%names; #Create the object $myseq = new Bio::Seq(-file=>'sample.tfa', -names=>$name_ref, -type=>'Dna', -origin=>'Xenopus mesoderm', -start=>'1', -desc=>'Sample Bio::Seq sequence', -ffmt=>'Fasta');
Once an object has been created, there are defined ways to go about accessing the information -- users are encouraged to poke around "under the hood" of Seq.pm to see what is going on but it is considered bad form to bypass the defined accession methods and mess around with the internal code. Bypassing the defined methods "voids the warrantee" of the module and can lead to problems down the road. The implied agreement between module creators and users is that the creators will strive to keep the interface standard and backwards-compatible while the users will avoid becoming dependent on bits of internal code that may change or disappear in future revisions.
Detailed information about each method described here can be found in the Appendix.
For each defined way to access information from a biosequence object, there is a corresponding "method" that is invoked. What follows is a brief description of each accessor method. For more detailed information see the individual annotations for each method near the end of this document.
Sequence
The sequence can be accessed in several ways via the getseq() method. Depending on how it is invoked, it can return either a string or a list value.
Both examples are appropriate:
@sequence_list = $myseq->getseq; $sequence_string = $myseq->getseq;
Sequence "slices" can be accessed by passing start and stop integer position arguments to getseq():
@slice = $myseq->getseq($start,$stop); @slice = $myseq->getseq(1,50); @slice = $myseq->getseq(100);
If no stop value is passed in, getseq() will return a slice from the start position to the end of the sequence. Slices are returned in the context of the object "start" attribute, not absolute position so be aware of the objects numbering scheme.
Sequences can also be accessed in with the ary() and str() methods. The ary() method will always return a list value and str() will always return a string. Otherwise they are functionally identical to the getseq() method.
$sequence = $myseq->str; @sequence = $myseq->ary; @slice = $myseq->ary($start,$stop); $slice = $myseq->str($start,$stop);
Sequence length
The sequence length can be accessed using the seq_len() method
$len = $myseq->seq_len;
Sequence ID
The ID field can be accessed using the id() method
$ID = $myseq->id;
Description
The object description field can be accessed using the desc() method
$description = $myseq->desc;
Names
The associative array (hash) that contains flexible information regarding alternative sequence names, database locations, accession numbers, etc. can be accessed by
%name_hash = $myseq->names;
Sequence start
The biological position of the first residue in the sequence sequence can be accessed via start()
$start = $myseq->start;
Sequence end
The biological position of the last residue in the sequence sequence can be accessed via end()
$end = $myseq->end;
Sequence Origin
The object origin (source organism) field can be accessed via origin()
$seq_origin = $myseq->origin;
File input format / default output format
The object format field can be accessed using the ffmt() method
$format = $myseq->ffmt;
In the previous section it was shown how object attributes and values could be retrieved from a sequence object by calling upon various methods. Many of the above methods will also allow the user to CHANGE object attributes by passing in additional arguments. Detailed information on each method can be found in the Appendix.
Changing the sequence
The sequence information for an object can be changed by passing a string or list value to the setseq() method. Here are some ways that sequence information can be changed
$myseq->seqseq($new_sequence_string); $myseq->setseq(@new_sequence_list); $myseq->setseq("aaccttgcctgc");
The setseq() method checks sequence elements and warns if it finds non-standard characters. Because of this, arbitrary sequence compositions are not supported at this time. This method is considered slightly 'insecure' because the 'id','desc' and 'type' fields are not updated along with the sequence. If necessary, the user must make the appropriate changes to these fields whenever sequence information is updated or changed.
Changing the sequence ID
The ID field can be changed by passing in a new ID argument to id()
$myseq->id($new_id);
Changing the object description
The object description field can be changed by passing in a new argument to desc()
$myseq->desc($new_desc);
Changing the object names hash
The associative array (hash) that contains flexible information regarding alternative sequence names, database locations, accession numbers, etc. can be changed by passing in a reference to a new hash to names()
$hash_ref = \%name_hash; $myseq->names($hash_ref);
Changing the sequence start or end
The default numbering offset for the sequence can be changed by passing in a new value to start() or end()
$myseq->start(1); $myseq->start($new_value);
The object origin field can be changed by passing in a new string value to origin()
$myseq->origin("mitochondrial"); $myseq->origin($origin_string);
The object format field can be accessed by passing in a new value to ffmt()
$myseq->ffmt("GCG");
Creating, accessing and changing biosequence objects and fields is all well and good, but eventually you are going to want to actually do some work.
Included with Seq.pm are some commonly used utility methods for manipulating sequence data. So far Seq.pm contains methods for:
Copying a biosequence object
using copy()
# NB - new_obj is a Bio::Seq object $new_obj = $myseq->copy;
Reversing a sequence
using reverse()
$reversed_seq = $myseq->reverse;
Complementing a sequence
The 2nd strand, or "complement" of a biosequence can be obtained by calling upon the complement() method.
$comp_seq = $myseq->complement;
Reverse complementing a sequence
using revcom()
# NB - rev_comp is a Bio::Seq object $rev_comp = $myseq->revcom;
Translating Dna to Rna
using Dna_to_Rna()
$rna_seq = $myseq->Dna_to_Rna;
Translating Rna to Dna
using Rna_to_Dna()
$dna_seq = $myseq->Rna_to_Dna;
Translating Dna or Rna to protein
using translate()
# NB - peptide_seq is a Bio::Seq object $peptide_seq = $myseq->translate;
Checking the sequence alphabet
To check if any nonstandard characters are present in a biosequence, an alphabet_ok() method is provided. The method returns "1" if everything is OK, otherwise it returns a "0".
if($myseq->alphabet_ok) { print "OK!!\n"; } else { print "Not OK! \n"; }
To get alphabet itself, use the alphabet() method, which will return a string containing all characters in the current alphabet.
$alph = $myseq->alphabet;
To use restrictive alphabets that do not permit ambiguity codes, include '-strict => 1' in the parameters sent to new(). Or, for any existing sequence object, try:
$myseq->strict(1); $myseq->alphabet_ok() or die "alphabet not okay.\n";
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
vsns-bcd-perl@lists.uni-bielefeld.de - General discussion vsns-bcd-perl-guts@lists.uni-bielefeld.de - Technically-oriented discussion http://bio.perl.org/MailList.html - About the mailing lists
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:
bioperl-bugs@bio.perl.org http://bio.perl.org/bioperl-bugs/
Some pieces of the code were contributed by Steven E. Brenner, Steve Chervitz, Ewan Birney, Tim Dudgeon, David Curiel, and other Bioperlers. Thanks !!!!
BioPerl Project Page http://bio.perl.org/
Bio::Seq.pm, beta 0.051
Copyright (c) 1996-1998 Chris Dagdigian, Georg Fuellen, Richard Resnick, and others All Rights Reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The following documentation describes the various functions contained in this module. Some functions are for internal use and are not meant to be called by the user; they are preceded by an underscore ("_").
Title : new Usage : $mySeq = Bio::Seq->new($file,$seq,$id,$desc,$names, $start,$end,$type,$ffmt,$descffmt); : - or - : $mySeq = Bio::Seq->new(-file=>$file, -seq=>$seq, -id=>$id, -desc=>$desc, -names=>$names, -start=>$start, -end=>$end, -type=>$type, -origin=>$origin, -ffmt=>$ffmt, -descffmt=>$descffmt); Function : The constructor for this class, returns a new object. Example : See usage Returns : Bio::Seq object Argument : $file: file from which the sequence data can be read; all the other arguments will overwrite the data read in. "_nofile" is recommanded if no file is given. $seq: String or array of characters $id: String describing the ID the user wishes to assign. $desc: String giving a description of the sequence $names: A reference to a hash which stores {loc,name} pairs of other database locations and corresponding names where the sequence is located. $start: The offset of the sequence, as an integer $end: The end point of the sequence, as an integer $type: The type of the sequence, see type() $origin: The sequence origin $ffmt: Sequence format, see ffmt() $descffmt: format of $desc, see descffmt()
Title : _initialize Usage : n/a (internal function) Function : Assigns initial parameters to a blessed object. Example : Returns : Argument : As Bio::Seq->new, allows for named or listed parameters. See ->new for the legal types of these values.
Title : _seq() Usage : n/a, internal function Function : called by new() to set sequence field. Checks : alphabet before setting. : Returns : n/a Argument : sequence string
Title : _monomer() Usage : n/a, internal function Function : Returns the internal monomer that represents : sequence type. : : Sequence type is treated internally as a monomer : defined by the %SeqAlph hash. The type field : is a list of format [monomer,origin]. For any : output outside the module, the monomer is resolved : back into string form via the %TypeSeq hash. : Returns : original type setting [as monomer] Argument : none
Title : _file_read() Usage : n/a (Internal Function) Function : _file_read is called whenever the constructor is called : with the name of a sequence to be read from disk. : : This function is now DEPRECATED. you should use the SeqIO : system : Example : n/a, only called upon by _initialize() Returns : Argument :
Title : seq_len() Usage : $len = $myseq->seq_len; Function : Returns a value representing the sequence : length : Example : see above Arguments : none Returns : integer
Title : ary Usage : ary([$start,[$end]]) Function : Returns the sequence of the object as an array, or a substring of the sequence if $start/$end are defined. If $start is defined and $end isn't, the substring is from $start to the end of the sequence. Example : @slice = $myObject->ary(3,9); Returns : array of characters Argument : $start,$end (both integers). They are interpreted w.r.t. the specific numeration of the sequence!! ($self->{start})
Title : str Usage : str([$start,[$end]]) Function : Returns the sequence of the object as a string, or a slice of the sequence if $start/$end are defined. If $start is defined and $end isn't, the slice is from $start to the end of the sequence. Example : $slice = $myObject->str(3,9); Returns : string scalar Argument : $start,$end (both integers). They are interpreted w.r.t. the specific numeration of the sequence!! ($self->{start})
Title : seq Usage : seq([$start,[$end]]) Function : Returns the sequence of the object as an array or a char string, depending on the value of wantarray. Will rtn a slice of the sequence if $start/$end are defined. If $start is defined and $end isn't, the slice is from $start to the end of the sequence. Example : @slice = $myObject->seq(3,9); Returns : regular array of characters, or a scalar string Argument : $start,$end (both integers). They are interpreted w.r.t. the specific numeration of the sequence!! ($self->{start}) Comments :
Title : getseq Usage : getseq([$start,[$end]]) Function : Returns the sequence of the object as an array or a char string, depending on the value of wantarray. Will rtn a slice of the sequence if $start/$end are defined. If $start is defined and $end isn't, the slice is from $start to the end of the sequence. Example : @slice = $myObject->seq(3,9); Returns : regular array of characters, or a scalar string Throws : Warning about deprecated method. Argument : $start,$end (both integers). They are interpreted w.r.t. the specific numeration of the sequence!! ($self->{start})
Title : id() Usage : $seq_id = $myseq->id; : $myseq->id($id_string); : Function : Sets field if an ID argument string is : passed in. If no arguments, returns ID value for : object. : Returns : original ID value Argument : sequence string
Title : desc() Usage : $description = $myseq->desc; : $myseq->desc($desc_string); : Function : Sets field if an argument string is : passed in. If no arguments, returns original value for : object description field. : Returns : original value for description Argument : sequence string
Title : names() Usage : %names = $myseq->names; : $myseq->names($hash_ref); : Function : Sets field if a name hash refrence is : passed in. If no arguments, returns original : names hash. : Returns : hash refrence (associative array) Argument : refrence to a hash (associative array)
Title : numbering() Usage : $num_start = $myseq->start; : $myseq->start($value); : Function : Sets field if an argument is : passed in. If no arguments, returns original value. : : (Deprecated - should switch to start()) Returns : original value Argument : new value
Title : start Usage : $start = $myseq->start(); #get : $myseq->start($value); #set Function : the set/get for the start position Example : Returns : start value Arguments : new value
Title : end Usage : $end = $myseq->end(); #get : $myseq->end($value); #set Function : The set/get for the end position Example : Returns : end value Arguments : new value
Title : get_nse Usage : $tag = $myseq->get_nse() # Function : gets a string like "name/start-end". This is likely : to be unique in an alignment/database : Used alot by SimpleAlign Example : Returns : A string Arguments: Two optional arguments - first being the name/ separator, second the start-end separator
Title : origin() Usage : myseq->origin($value) Function : Sets the origin field which is actually the second : field of the Type list. The {type} field is a 2 value list : with a format of ["Monomer","Origin"] : Returns : Original value Argument : string Comments : SAC: Consider renaming this method to "organism()" or "species()". : "origin" is ambiguous and can be easily confused with : a coordinate data (0,0).
Title : type() Usage : myseq->type($value) Function : Sets the type field which is the first : field of the Type list. The {type} field is a 2 value list : with a format of ["Monomer","Origin"] : Returns : String containing one of the recognized sequence types: : 'unknown', 'dna', 'rna', 'amino', 'otherseq', 'aligned' : See the %Seq::SeqAlph hash for the current types. Argument : string containing a valid sequence type : SAC: case of user-supplied argument does not matter
Title : ffmt() Usage : $format = $myseq->ffmt; : $myseq->ffmt("Fasta"); : Function : The file format field is used by the internal : sequence parsing code when trying to read : in a sequence file. It is also what is used : as a default output format if the layout : method is called without an argument. : : If a sequence object is created without : reading in a file, or if the file is read : in with the use of the ReadSeq package then : the ffmt field can be set to indicate any default : output-format preference. : : If a sequence is read from a file and parsed : by internal code (ReadSeq not used) then the ffmt : field should describe the format of the sequence : file. The ffmt field is used to send the sequence : to the correct internal parsing code. : Returns : original ffmt value Argument : recognized ffmt string value (see list of recognized : formats) # SAC: What are they?! This list should be obvious. : Valid strings: : RAW, FASTA, GCG, IG, GENBANK, NBRF, EMBL, : MSF, PIR, GCG_SEQ, GCG_REF, STRIDER, ZUKER, : SAC: case of user-supplied argument does not matter
Title : descffmt() Usage : $desc = $myseq->descffmt; : $myseq->descffmt($new_value); Function : : Returns : original value Argument : $new_value (one of the formats as defined in $SeqForm). : SAC: case of $new_value argument does not matter.
Title : setseq() Usage : $self->setseq($new_sequence); Function : Changes the sequence inside a bioseq object : Returns : sequence string Argument : sequence string
Title : parse Usage : parse($ent,[$ffmt]); Function : Invokes the proper parsing code depending on : the value of the object 'ffmt' field. Example : $self->parse; Returns : n/a Argument : the prospective sequence to be parsed, : and optionally its format so that it doesn't need to : be estimated : SAC: case of $ffmt argument does not matter.
Title : parse_raw Usage : parse_raw; Function : parses $ent into the $self->{"seq"} field, using Raw : file format. Example : $self->parse_raw; Returns : n/a Argument : n/a
Title : parse_genbank
= cut
sub parse_genbank { my ($self) = shift; my ($ent) = @_; my $seqstart = false; my $defstart = false;
my @lines = split("\n", $ent); for ( @lines ) { chomp; m/LOCUS\s*(\S+)/ and $self->{"id"} = $1; m/DEFINITION\s*(.+)/ and do { $self->{"desc"} = $1; $defstart = true; }; $defstart and do { m/^ {11}( .+)/ or $defstart = false; $defstart and $self->{"desc"} .= $1; }; m/ORIGIN/ and do { $seqstart = true; next; }; m!//! and $seqstart = false; $seqstart and do { s/[\s|\d]//g; $self->{"seq"} .= $_; }; } return 1; }
#_______________________________________________________________________
Title : parse_fasta Usage : parse_fasta; Function : parses $ent into the "seq" field, using Fasta : file format. : To-do : use benchmark module to find best/fastest parse : method : Example : $self->parse_fasta; Returns : n/a Argument : n/a
Title : parse_gcg Usage : used by internal code Function : Parses the sequence out of a gcg-format string and : sets the object sequence field accordingly. This is : a simple, ineffecient method for grabbing JUST the : sequence. : To-do : - parse out more info than just sequence : - implement alphabet checking : - better regular expressions/efficiency : - carp on unexpected / wrong-format situations : Version : .01 / 16 Jan 1997 Returns : 1 Argument : gcg-formatted sequence string
Title : layout() Usage : layout([$format]); Function : Returns the sequence in whichever format the user specifies, or in the "ffmt" field if the user does not specify a format. Example : $fastaFormattedSeq = $myObj->layout("Fasta"); Returns : varies Argument : $format (one of the formats as defined in $SeqForm). : SAC: case of $ffmt argument does not matter.
Title : out_raw Usage : out_raw; Function : Returns the sequence in Raw format. Example : $self->out_raw; Returns : string sequence, in raw format Argument : n/a
Title : out_fasta Usage : out_fasta; Function : Returns the sequence as a string in FASTA format. Example : $self->out_fasta; : To-do : benchmark code / find fastest method : Returns : string sequence in Fasta format Argument : n/a
Title : alphabet_ok Usage : $myseq->alphabet_ok; Function : Checks the sequence for presence of any characters : that are not considered valid members of the genetic : alphabet. In addition to the standard genetic alphabet : (see documentation), "?" and "-" characters are : considered valid. : Example : if($myseq->alphabet_ok) { print "OK!!\n"; } : else { print "Not OK! \n"; } : Note : Does not handle '\' characters in sequence robustly : Returns : 1 if OK / 0 if not OK Argument : none
Title : alphabet Usage : $myseq->alphabet; Function : Returns the characters in the alphabet in use for the sequence. Example : print "Alphabet: ".$myseq->alphabet; Returns : string containing alphabet characters Argument : none
Title : GCG_checksum Usage : $myseq->GCG_checksum; Function : returns a gcg checksum for the sequence Example : Returns : Argument : none
Title : trunc Usage : $trunc_seq = $mySeq->trunc(12,20); Function : Returns a truncated part of the sequence, truncation happening by the ->str() call. This is just a convience call therefore for this object Returns : Bio::Seq object ref. Argument : start point, end point in biological coordinates
Title : copy Usage : $copyOfObj = $mySeq->copy; Function : Returns an identical copy of the object. Example : Returns : Bio::Seq object ref. Argument : n/a
Title : revcom Usage : $reverse_complemented_seq = $mySeq->revcom; Function : Returns a Bio::Seq object with the reverse : complement of a nucleotide object sequence Example : $reverse_complemented_seq = $mySeq->revcom; Source : Guts from Jong's <jong@mrc-lmb.cam.ac.uk> : library of molbio perl routines Note : : The letter codes and compliment translations : are those proposed by IUB (Nomenclature Committee, : 1985, Eur. J. Biochem. 150; 1-5) and are also : used by the GCG package. The IUB/GCG letter codes : for nucleotide ambiguity are compatible with : EMBL, GenBank and PIR database formats but are : *NOT* compatible with Stadem/Sanger ambiguity : symbols. Staden/Sanger use different symbols to : represent uncertainty and frame abiguity. : : Currently Staden/Sanger are not recognized : sequence types. : : GCG Documentation on sequence symbols: URL : http://www.neb.com/gcgdoc/GCGdoc/Appendices/appendix_iii.html : Translation : : GCG/IUB Meaning Complement : ------------------------------------ : A A T : C C G : G G C : T T A : U U A : M A or C K : R A or G Y : W A or T W : S C or G S : Y C or T R : K G or T M : V A or C or G B : H A or C or T D : D A or G or T H : B C or G or T V : X G or A or T or C X : N G or A or T or C N :-------------------------------------- Revision : 0.01 / 3 Jun 1997 Returns : A new sequence object to get the actual sequence go $actual_reversed_sequence = $seq->revcom()->str() Argument : n/a
Title : complement Usage : $complemented_seq = $mySeq->compliment; Function : Returns a char string containing : the complementary sequence (eg; other strand) : of the original sequence. The translation method : is identical to revcom() but the nucleotide order : is not reversed. : : To be honest *most* of the time you will want : to use revcom not this. Be careful! : Example : $complemented_seq = $mySeq->complement; : Source : Guts from Jong's <jong@mrc-lmb.cam.ac.uk> : library of molbio perl routines Note : : The letter codes and complement translations : are those proposed by IUB (Nomenclature Committee, : 1985, Eur. J. Biochem. 150; 1-5) and are also : used by the GCG package. The IUB/GCG letter codes : for nucleotide ambiguity are compatible with : EMBL, GenBank and PIR database formats but are : *NOT* compatible with Stadem/Sanger ambiguity : symbols. Staden/Sanger use different symbols to : represent uncertainty and frame abiguity. : : Currently Staden/Sanger are not recognized : sequence types. : : GCG Documentation on sequence symbols: URL : http://www.neb.com/gcgdoc/GCGdoc/Appendices : /appendix_iii.html : Translation : : GCG/IUB Meaning Complement : ------------------------------------ : A A T : C C G : G G C : T T A : U U A : M A or C K : R A or G Y : W A or T W : S C or G S : Y C or T R : K G or T M : V A or C or G B : H A or C or T D : D A or G or T H : B C or G or T V : X G or A or T or C X : N G or A or T or C N :-------------------------------------- : Revision : 0.01 / 6 Dec 1996 Returns : char string Argument : n/a
#_______________________________________________________________________'
Title : reverse Usage : $reversed_seq = $mySeq->reverse; Function : Returns a char string containing the : reverse of the object sequence : : Does *NOT* complement it. If you want : the other strand, use $mySeq->revcom() : Example : $reversed_seq = $mySeq->reverse; : Revision : 0.01 / 6 Dec 1996 Returns : char string Argument : n/a
Title : Dna_to_Rna Usage : $translated_seq = $mySeq->Dna_to_Rna; Function : Returns a char string containing the : Rna translation of the Dna nucleotide sequence : (Replaces T with U) : Example : $translated_seq = $mySeq->Dna_to_Rna; : Source : modified from Jong's <jong@mrc-lmb.cam.ac.uk> : library of molbio perl routines : Revision : 0.01 / 6 Dec 1996 Returns : char string Argument : n/a
Title : Rna_to_Dna Usage : $translated_seq = $mySeq->Rna_to_Dna; Function : Returns a char string containing the : Dna translation of the Rna nucleotide sequence : (Replaces U with T) : Example : $translated_seq = $mySeq->Rna_to_Dna; : Revision : 0.01 / 16 MAR 1997 Returns : char string Argument : n/a
Title : translate Usage : Function : Returns a new Bio::Seq object with the protein : translation from this sequence : : "*" is the default symbol for a stop codon : "X" is the default symbol for an unknown codon : Example : $translation = $mySeq->translate; : -or- with user defined stop/unknown codon symbols: : $translation = $mySeq->translate($stop_symbol,$unknown_symbol); : Source : modified from Jong's <jong@mrc-lmb.cam.ac.uk> : library of molbio perl routines : To-do : - allow named parameters (just like new and out_GCG ) : - allow "frame" parameter to pick translation frame : Revision : 0.01 / 6 Dec 1996 Returns : new Sequence object. Its id is the original id.trans Argument : n/a
Title : dump Usage : @results = $mySeq->dump; -or- : $results = $mySeq->dump; : Function : Returns a formatted array or string (depending on how it : is invoked) containing the contents of a : Bio::Seq object. Useful for debugging : : ***This is used by Chris Dagdigian for debugging *** : ***Probably should be removed before distribution*** : Example : @results = $mySeq->dump; : foreach(@results){print;} : -or- : print $myseq->dump; : Returns : Array or string depending on value of wantarray Argument : n/a
Title : out_bad() Usage : out_bad; Function : Throws a fatal error if we don't know the output format. Example : $self->out_bad; Returns : n/a Argument : n/a
Title : out_primer() Usage : $formatted_seq = $myseq->out_primer; : @formatted_seq = $myseq->out_primer; : : print $myseq->out_primer(-id=>'New ID', : -header=>'This is my header'); : Function : outputs a sequence in primer format : Note : Not a supported output type - (cant be invoked via layout) : Use at your own risk :) : Example : see usage : Revision : 0.01 / 20 Dec 1996 Returns : string or list, depending on how it is invoked Argument : named list parameters for "id" and "header" are alowed
Title : out_pir() Usage : $formatted_seq = $myseq->layout("PIR"); : $formatted_seq = $myseq->out_pir; : @formatted_seq = $myseq->out_pir; : : print $myseq->out_pir(-title=>'New TITLE', : -entry=>'New ENTRY', : -acc=>'User defined accession', : -date=>'User defined date', : -reference=>'User defined ref info'); : Function : Returns a string or an array depending on how it : is invoked. Can be easily accessed via the layout() : method, or if more output control is desired it can : be called directly with the folowing named parameters: : : -entry PIR entry : -title PIR title : -acc user defined accession number : -reference user defined reference : -date user defined date/time info : : All named parameters will take precedance over any : default behavior. When there are no user arguments, : the default output is as follows: : : PIR 'ENTRY' = sequence object "id" field : PIR 'TITLE' = sequence object "desc" field : PIR 'DATE' = curent date/time : PIR 'ACC' = not used in default output : PIR 'REFERENCE' = not used in default output : Note : Not tested stringently. : WARNING : Does not deal with numbering issue : To-do : - Allow user to pass in hash of additional fields/values : - Deal with numbering issue : Example : see usage : Revision : 0.02 / 12 Jan 1997 Returns : string or list, depending on how it is invoked Argument : named list parameters are allowed, see above
Title : out_genbank() Usage : $formatted_seq = $myseq->out_genbank; : @formatted_seq = $myseq->out_genbank; : print $myseq->out_genbank(-id=>'New ID', : -def=>'User defined definition', : -acc=>'User defined accession', : -origin=>'User defined origin info', : -spacing=>'single', : -caps=>'up', : -date=>'DATE GOES HERE', : -type=>'mRna'); : Function : Returns a GenBank formatted sequence array or string : depending on the value of wantarray when invoked via layout(). : If more control is desired over output format, out_genbank() : can be addressed directly with the following named parameters: : : def - Sequence definition information : acc - Sequence accession number : origin - Sequence origin information : id - short name : date - new date info : type - sequence type (Dna, mRna, Amino, etc.) : spacing - "single" or "double" sequence line spacing : caps - "up" or "down" sequence capitalization : : When invoked via layout() or called directly with no : arguments, the following default behaviours apply: : DATE = Current date and time : DEFINITION = object's description field : ID = object's ID field : SPACING = single : : All named parameters must be strings. Passed in parameters will : always take precedence over any fields with default settings. : Note : Format not stringently tested for accuracy. Sequence is numbered : according to the integer specified in the object 'start' field : but the implementation has not been robustly tested. : To-do : - allow user hash reference for additional format fields : Example : see usage : Revision : 0.02 / 12 Jan 1997 Returns : string or list, depending on how it is invoked Argument : named list parameters are allowed, see above
Title : out_GCG Usage : $formatted_seq = $mySeq->layout("GCG"); : @formatted_seq = $mySeq->layout("GCG"); : : print $myseq->out_GCG(-id=>'New ID', : -spacing=>'single', : -caps=>'up', : -date=>'DATE GOES HERE', : -header=>'This is a user submitted header', : -type=>'n'); : Function : Returns a GCG formatted sequence array or string : depending on the value of wantarray when invoked via layout(). : If more control is desired over output format, out_GCG() : can be addressed directly with the following named parameters: : : header - first line(s) of formatted sequence : id - short name that appears before 'Length:' field : date - overwrite default date info : type - can be "N" or "P", for nucleotide/protein : spacing - "single" or "double" sequence line spacing : caps - "up" or "down" sequence capitalization : : When invoked via layout() or called directly with no : arguments, the following default behaviours apply: : DATE = Current date and time : DEFINITION = object's description field : ID = object's ID field : SPACING = single : : All named parameters must be strings. Passed in parameters will : always take precedence over any fields with default settings. : Example : Output : :Sample Bio::Seq sequence : sample Length: 240 Wed Nov 27 13:24:28 EST 1996 Type: N Check: 5371 .. : : 1 aaaacctatg gggtgggctc tcaagctgag accctgtgtg cacagccctc : 51 tggctggtgg cagtggagac gggatnnnat gacaagcctg ggggacatga : 101 ccccagagaa ggaacgggaa caggatgagt gagaggaggt tctaaattat : 151 ccattagcac aggctgccag tggtccttgc ataaatgtat agagcacaca : 201 ggtgggggga aagggagaga gagaagaagc cagggtataa : : Note : GCG formatted sequences contain a "Type:" field. : If Type cannot be internally determined and no : Type name-parameter is passed in then the Type: : field is not printed. : Warning : Unconventional numbering offsets may not : be robustly handled : Revision : 0.06 / 12 Jan 1997 Source : Found guts of this code on bionet.gcg, unknown author Returns : Array or String Argument : n/a
Title : out_nbrf() Usage : $self->layout("NBRF") or $self->out_nbrf : Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!! : : If the ReadSeq wrapper Parse.pm apppears : to be configured properly it is used : to generate the output. : : If Parse.pm cannot be used then this code : carps out with an error message. : To-do : write internal output code : Version : 1.0 / 16 MAR 1997 Example : see Usage Returns : FORMATTED STRING (wantarray is not used here!) Argument :
Title : out_gcgseq Usage : out_gcgseq; Function : Returns the sequence as a string in GCG_SEQ format. Example : $self->out_gcgseq; : Returns : string sequence in GCG_SEQ format Argument : n/a Comments : SAC: Derived from out_fasta(). : GCG_SEQ is a format that looks alot like Fasta and is used : for building GCG sequence datasets (.seq files). : It also has some similarities to NBRF format.
Title : out_gcgref Usage : out_gcgref; Function : Returns the sequence as a string in GCG_REF format. Example : $self->out_gcgref; : Returns : string sequence in GCG_REF format Argument : n/a Comments : SAC: Derived from out_gcgseq(). : GCG_REF is a companion format for GCG_SEQ that is used : for building GCG sequence datasets (.ref files). : The .ref file is identical to .seq file but without the sequence.
Title : out_ig() Usage : $self->layout("IG") or $self->out_ig : Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!! : : If the ReadSeq wrapper Parse.pm apppears : to be configured properly it is used : to generate the output. : : If Parse.pm cannot be used then this code : carps out with an error message. : To-do : write internal output code : Version : 1.0 / 16 MAR 1997 Example : see Usage Returns : FORMATTED STRING (wantarray is not used here!) Argument :
Title : out_strider() Usage : $self->layout("Strider") or $self->out_strider : Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!! : : If the ReadSeq wrapper Parse.pm apppears : to be configured properly it is used : to generate the output. : : If Parse.pm cannot be used then this code : carps out with an error message. : To-do : write internal output code : Version : 1.0 / 16 MAR 1997 Example : see Usage Returns : FORMATTED STRING (wantarray is not used here!) Argument :
Title : out_zuker() Usage : $self->layout("Zuker") or $self->out_zuker : Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!! : : If the ReadSeq wrapper Parse.pm apppears : to be configured properly it is used : to generate the output. : : If Parse.pm cannot be used then this code : carps out with an error message. : To-do : write internal output code : Version : 1.0 / 16 MAR 1997 Example : see Usage Returns : FORMATTED STRING (wantarray is not used here!) Argument :
Title : out_msf() Usage : $self->layout("MSF") or $self->out_msf : Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!! : : If the ReadSeq wrapper Parse.pm apppears : to be configured properly it is used : to generate the output. : : If Parse.pm cannot be used then this code : carps out with an error message. : To-do : write internal output code : Version : 1.0 / 16 MAR 1997 Example : see Usage Returns : FORMATTED STRING (wantarray is not used here!) Argument :
Title : parse_unknown Usage : parse_unknown($ent); Function : tries to figure out the format of $ent and then : calls the appropriate function to parse it into $self->{"seq"}. Example : $self->parse_unknown; Returns : n/a Argument : $ent : the rough multi-line string to be parsed
Title : parse_bad Usage : parse_bad; Function : complains of un-parsable sequence, last-ditch attempt via : Parse.pm if sequence is being read from a file. : Example : $self->parse_bad; Returns : n/a Argument : n/a
Title : version(); Usage : $myseq->version; Function : prints Bio::Seq current version number
The sequence object is merely a reference to a hash containing all or some of the following fields... Field Value -------------------------------------------------------------- seq the sequence id a short identifier for the sequence desc a description of the sequence, in descffmt file-format names a hash of identifiers that relate to the sequence.. these could be Database ID's, Accession #'s, URL's, pathnames, etc. Currently there is no set format for the names hash and no formal definition of databases or names start start in bio-coords of the first residue of the sequence end end in bio-coords of the first residue of the sequence type the sequence type. Is actually a 2 value list of format ["monomer","origin"] where monomer is one of the recognized sequence types and origin is a string description of the sequences' origin (mitochondrial, etc) ffmt file-format for the sequence descffmt file-format of the description string
To install Bio::Seq, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Seq
CPAN shell
perl -MCPAN -e shell install Bio::Seq
For more information on module installation, please visit the detailed CPAN module installation guide.