Bio::MUST::Core::Ali - Multiple sequence alignment
version 0.182420
#!/usr/bin/env perl use Modern::Perl '2011'; # same as: # use strict; # use warnings; # use feature qw(say); use Bio::MUST::Core; use aliased 'Bio::MUST::Core::Ali'; # read Ali form disk my $ali = Ali->load('example.ali'); # get some properties say 'height: ' . $ali->height; # number of seqs say 'width: ' . $ali->width; # number of sites say '% miss: ' . $ali->perc_miss; # fraction of missing chars (%) say 'seqs are ' . $ali->is_protein ? 'prot' : 'nucl'; # turn seqs to uppercase $ali->uc_seqs; # filter out seqs with no species associated my @seqs = $ali->filter_seqs( sub { not $_->is_genus_only } ); use aliased 'Bio::MUST::Core::IdList'; my $list = IdList->new( ids => \@seqs ); $ali->apply_list($list); # alternatively: # $ali = Ali->new( seqs => \@seqs ); # filter out gap-rich sites $ali->idealize(0.2); # min 20% non-gaps per site # filter out non-informative sites use aliased 'Bio::MUST::Core::SeqMask'; my $mask = SeqMask->parsimony_mask($ali); $ali->apply_mask($mask); # write down reduced Ali to disk $ali->store('example-uc-genus-sp-20.ali'); $ali->store_fasta('example-uc-genus-sp-20.fasta'); # see below for additional methods # ...
This module implements the multiple sequence alignment (MSA) class and its methods. An Ali is modeled as an array of Bio::MUST::Core::Seq objects. Consequently, sequence ids do not absolutely need to be unique for it to work (though id uniqueness helps a lot for sequence access and filtering).
An Ali knows whether it contains nucleotide or protein sequences, whether those are aligned or not, as well as its Seq count and exact width. All these properties are introspected on the fly, which means that, while they can be expensive to compute, they are always accurate.
This is important because an Ali provides methods for inserting, deleting and editing its Seq objects. Further, it does its best to maintain a semantic distinction between true gap states (encoded as '*') and missing regions of undetermined length (encoded as pure whitespace, for example at the end of a short sequence).
It also has methods for mapping its sequence ids (for example, before export to the PHYLIP format), as well as methods to selectively retain sequences and sites based on Bio::MUST::Core::IdList and Bio::MUST::Core::SeqMask objects. For example, the idealize method discards shared gaps and optionally removes gap-rich sites only due to a tiny number of sequences.
idealize
Finally, an Ali can be stored in MUST pseudo-FASTA format (which handles meaningful whitespace and allows comment lines in the header) or be imported/exported from/to several popular MSA formats, such as plain FASTA, STOCKHOLM and PHYLIP.
ArrayRef of Bio::MUST::Core::Seq objects (optional)
Most of the accessor methods described below are implemented by delegation to this public attribute using "Moose native delegation" in Moose::Meta::Attribute::Native. Their documentation thus heavily borrows from the corresponding help pages.
Path::Class::File object (optional)
This optional attribute is initialized by class methods that load an Ali from disk. It is meant to improve the introspection capabilities of the Ali. For now, this attribute is not used by the store methods, though it might provide them with a default value in the future.
load
store
Boolean (optional)
By default, an Ali object tries to guess whether it is aligned or not by looking for gap-like characters in any of its Seq objects (see Bio::MUST::Core::Seq for the exact test performed on each sequence).
When this smart behavior causes issues, one can disable it by unsetting this boolean attribute (see dont_guess and guess accessor methods).
dont_guess
guess
ArrayRef of strings (optional)
An Ali object is commentable, which means that it supports all the methods pertaining to comment lines described in Bio::MUST::Core::Roles::Commentable (such as header).
header
Default constructor (class method) returning a new Ali.
use aliased 'Bio::MUST::Core::Ali'; my $ali1 = Ali->new(); my @seqs = $ali->all_seqs; my $ali2 = Ali->new( seqs => \@seqs );
This method accepts four optional arguments (see ATTRIBUTES above): seqs, file, guessing and comments.
seqs
file
guessing
comments
Creates a deep copy (a clone) of the Ali. Returns the copy.
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('input.ali'); my $ali_copy = $ali->clone; # you can now mess with $ali_copy without affecting $ali
This method does not accept any arguments.
Adds one (or more) new sequence(s) at the end of the Ali. Returns the new number of sequences of the Ali.
use aliased 'Bio::MUST::Core::Seq'; my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' ); $ali->add_seq($new_seq);
This method accepts any number of arguments.
Returns a sequence of the Ali by its index. You can also use negative index numbers, just as with Perl's core array handling. If the specified sequence does not exist, this method will return undef.
undef
my $seq = $ali->get_seq($index); croak "Seq $index not found in Ali!" unless defined $seq;
This method accepts just one argument (and not an array slice).
Returns a sequence of the Ali by its id. If the specified id is not unique, only the first matching sequence will be returned, whereas if no sequence exists for the specified id, this method will return undef.
my $id = 'Pyrus malus_3750@658052655'; my $seq = $ali->get_seq_with_id($id); croak "Seq $id not found in Ali!" unless defined $seq;
This method accepts just one argument.
Given an index and a sequence, sets the specified Ali element to the sequence. This method returns the new sequence at $index.
$index
use aliased 'Bio::MUST::Core::Seq'; my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' ); $ali->set_seq($index, $new_seq);
This method requires two arguments.
Removes the sequence at the given index from the Ali. This method returns the deleted sequence. If the specified sequence does not exist, it will return undef instead.
$ali->delete_seq($index);
This method requires one argument.
Inserts a new sequence into the Ali at the given index. This method returns the new sequence at $index.
use aliased 'Bio::MUST::Core::Seq'; my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' ); $ali->insert_seq($index, $new_seq);
Returns all the sequences of the Ali (not an array reference).
my @seqs = $ali->all_seqs;
Returns the first sequence of the Ali matching a given criterion, just like List::Util's first function. This method requires a subroutine implementing the matching logic.
first
# emulate get_seq_with_id method my $id2find = 'seq13'; my $seq = $ali->first_seq( sub { $_->full_id eq $id2find } );
This method requires a single argument.
Returns every sequence of the Ali matching a given criterion, just like Perl's core grep function. This method requires a subroutine implementing the matching logic.
grep
# keep only long sequences (ignoring gaps and missing states) my @long_seqs = $ali->filter_seqs( sub { $_->nomiss_seq_len > 500 } );
Returns all the sequence ids (Bio::MUST::Core::SeqId objects) of the Ali (not an array reference). This is only a convenience method.
use Test::Deeply; my @ids1 = $ali->all_seq_ids; my @ids2 = map { $_->seq_id } $ali->all_seqs; is_deeply \@ids1, \@ids2, 'should be true'; my @orgs = map { $_->org } @ids1;
Returns the stringified filename of the Ali.
Turn on the smart detection of gaps (see guessing attribute above).
Turn off the smart detection of gaps (see guessing attribute above).
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('ensembl.fasta'); $ali->dont_guess;
Returns true if all the sequence ids are unique.
carp 'Warning: duplicate sequence ids!' unless $ali->has_uniq_ids;
Returns true if any sequence of the Ali looks like a protein. See Bio::MUST::Core::Seq for the exact test performed on each sequence.
say 'Your file includes nucleotide sequences' unless $ali->is_protein;
Returns true if any sequence of the Ali appears to be aligned. See Bio::MUST::Core::Seq for the exact test performed on each sequence.
If the boolean attribute guessing is not set, always returns false.
carp 'Warning: file does not look aligned!' unless $ali->is_aligned;
Returns the number of sequences of the Ali. The alias method height is provided for convenience.
height
my $height = $ali->count_seqs;
Returns the width of the Ali (in characters). If the Ali is not aligned, returns the length of the longest sequence instead.
To avoid potential bugs due to caching, this method dynamically computes the Ali width at each call. Moreover, the Ali is always uniformized (see below) beforehands to ensure accurate width value. Therefore, this method is expensive and should not be called repeatedly (e.g., in a loop condition).
# you'd better looping through sites like this... my $width = $ali->width; for my $site (0..$width-1) { ... }
Returns a list of 5 values summarizing the Ali seq lengths (ignoring gaps). The values are the following: Q0 (min), Q1, Q2 (median), Q3, and Q4 (max).
Returns the percentage of missing (and gap-like) character states in the Ali.
As this method internally calls Ali::width, the remarks above also apply.
Ali::width
my $miss_level = $ali->perc_miss;
Turn all the sequences of the Ali to uppercase and returns it.
Recode all the sequences of the Ali and returns it.
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('biased.ali'); # set up RY recoding for suppressing codon bias my %base_for = ( A => 'A', G => 'A', # purines C => 'C', T => 'C', # pyrimidines ); my $ali_rec = $ali->recode_seqs( \%base_for ); $ali_rec->store('biased_ry.ali');
Remove the gaps in all the sequences of the Ali and returns it.
Spacifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
Gapifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
This method accepts an optional argument.
Trims all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
Pads all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
Performs the three gap-cleaning operations in turn on all the sequences of the Ali and returns it, which ensures that it is semantically clean and rectangular.
This is only a convenience method called internally by the Ali object before selected methods (such as storage-like methods). However, it might prove useful in some circumstances, hence it is not defined as private.
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('input.ali'); $ali->add_seq(...); # more editing of the Ali sequences $ali->uniformize;
Replaces all the sequence ids of the Ali by their abbreviated forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.
Note that this method will work only if the sequence ids have not been already shortened or modified in any way since the creation of the IdMapper. Long ids without abbreviated forms in the IdMapper are left untouched.
use aliased 'Bio::MUST::Core::Ali'; use aliased 'Bio::MUST::Core::IdMapper'; my $ali = Ali->load('input.ali'); my $mapper = IdMapper->std_mapper($ali, 'lcl|seq'); $ali->shorten_ids($mapper); $ali->store_fasta('input.4blast.fasta'); # makeblastdb # Note: the temp_fasta method does exactly that
Replaces all the sequence ids of the Ali by their long forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.
Note that this method will work only if the sequence ids have been previously abbreviated (see above) and have not been modified in any way since then. Again, abbreviated ids without long forms in the IdMapper are left untouched.
use aliased 'Bio::MUST::Core::IdMapper'; my $mapper = IdMapper->gi_mapper($ali); $ali->shorten_ids($mapper); ... $ali->restore_ids($mapper);
Selectively retains or discards sequences from the Ali based on the content of the passed Bio::MUST::Core::IdList object and returns the Ali.
use aliased 'Bio::MUST::Core::IdList'; my $list = IdList->load('interesting_seqs.idl'); $ali->apply_list($list); # discard non-interesting seqs
Selectively retains or discards sites from the Ali based on the content of the passed Bio::MUST::Core::SeqMask object and returns the Ali.
use aliased 'Bio::MUST::Core::SeqMask'; my $variable_sites = SeqMask->variable_mask($ali); $ali->apply_mask($variable_sites); # discard constant sites
Computes and applies an ideal sequence mask to the Ali and returns it. This is only a convenience method.
When invoked without arguments, it will discard the gaps that are universally shared by all the sequences. Otherwise, the provided argument corresponds to the threshold of the ideal_mask method described in Bio::MUST::Core::SeqMask.
ideal_mask
use aliased 'Bio::MUST::Core::IdList'; my $fast_seqs = IdList->load('fast_evolving_seqs.idl'); my $seqs2keep = $fast_seqs->negative_list($ali); $ali->apply_list($seqs2keep); # discard fast-evolving seqs $ali->idealize; # discard newly shared gaps caused by fast seqs use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('hmm_based.ali'); $ali->idealize(0.05); # discard insertions due to <5% of the seqs
Returns a regular expression matching gaps and ambiguous or missing states. The exact regex returned depends on the type of sequences in the Ali (nucl. or proteins).
my $regex = $ali->gapmiss_regex; my $first_seq = $ali->get_seq(0)->seq; my $gapmiss_n = $first_seq =~ m/($regex)/xmsg; say "The first sequence has $gapmiss_n gaps or ambiguous/missing sites";
Converts a set of site positions from Ali coordinates to coordinates of the specified sequence (thereby ignoring positions due to gaps). Returns the converted sites in sequence coordinates as an array refrence.
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('input.ali'); my $id = 'GIV-Norovirus Hum.GIV.1.POL_1338688@508124125'; my $ali_coords = [ 4, 25, 73, 89, 104, 116 ]; my $seq_coords = $ali->map_coords($id, $ali_coords); # $seq_coords is [ 3, 23, 59, 71, 71, 74 ]
This method requires two arguments: the id of a sequence and an array reference of input sites in Ali coordinates.
Class method (constructor) returning a new Ali read from disk. This method will transparently import plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
use Test::Deeply; use aliased 'Bio::MUST::Core::Ali'; my $ali1 = Ali->load('example.ali'); my $ali2 = Ali->load('example.fasta'); my @seqs1 = $ali1->all_seqs; my @seqs2 = $ali2->all_seqs; is_deeply, \@seqs1, \@seqs2, 'should be true';
Writes the Ali to disk in the MUST pseudo-FASTA format (ALI files).
Note that the ALI format is only used when the suffix of the outfile name is '.ali'. In all other cases (including lack of suffix), this method automatically forwards the call to store_fasta.
store_fasta
$ali->store('output.ali'); # output.ali is written in ALI format $ali->store('output.fasta'); # output.fasta is written in FASTA format
This method requires one argument (but see store_fasta in case of automatic forwarding of the method call).
Writes the Ali to disk in the plain FASTA format.
For compatibility purposes, this method automatically fetches sequence ids using the foreign_id method instead of the native full_id method, both described in Bio::MUST::Core::SeqId.
foreign_id
full_id
$ali->store_fasta( 'output.fasta' ); $ali->store_fasta( 'output.fasta', {chunk => -1, degap => 1} );
This method requires one argument and accepts a second optional argument controlling the output format. It is a hash reference that may contain one or more of the following keys:
- clean: replace all ambiguous and missing states by C<X> (default: false) - degap: boolean value controlling degapping (default: false) - chunk: line width (default is 60 chars; negative values means no wrap)
Finally, it is possible to fine-tune the behavior of the clean option by providing another character than X through the gapify key. This can be useful to replace all ambiguous and missing states by gaps, as shown below:
clean
X
gapify
$ali->store_fasta( 'output.fasta, { clean => 1, gapify => '*' } );
Writes a temporary copy of the Ali to disk in the plain FASTA format using numeric sequence ids and returns the name of the temporary file. This is only a convenience method.
In list context, returns the IdMapper object along with temporary filename.
my $infile = $ali->temp_fasta( { degap => 1 } ); my $output = `script.sh $infile`; ...
This method accepts the same optional argument hash as store_fasta. However, an additional option (id_prefix) is available to control the way abbreviated sequence ids are prefixed by the std_mapper method (see Bio::MUST::Core::Listable).
id_prefix
std_mapper
my $infile1 = $ali1->temp_fasta( { id_prefix => 'file1-' } ); my $infile2 = $ali2->temp_fasta( { id_prefix => 'file2-' } );
Class method (constructor) returning a new Ali read from a file in the PHYLIP format. Both sequential and interleaved formats are supported.
The only constraint is that sequence ids MUST NOT contain whitespace and be followed by at least one whitespace character. This means that some old-school PHYLIP files not following this convention will not be processed correctly.
When using the interleaved format, sequence ids may or may not be repeated in each block.
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load_phylip('phylip.phy'); say $ali->count_seqs; say $ali->width; # outputs: # 10 # 709
Writes the Ali to disk in the interleaved (or sequential) PHYLIP format.
To ensure maximal flexibility, this method fetches sequence ids using the native full_id method described in Bio::MUST::Core::SeqId, but truncates them to 10 characters, as expected by the original PHYLIP software package. No other tinkering is carried out on the ids. Thus, if the ids contain whitespace or are not unique in their 10 first characters, it is advised to first map them using one of the constructors in Bio::MUST::Core::IdMapper.
use aliased 'Bio::MUST::Core::Ali'; use aliased 'Bio::MUST::Core::IdMapper'; my $ali = Ali->load('input.ali'); my $mapper = IdMapper->std_mapper($ali); $ali->shorten_ids($mapper); $ali->store_phylip( 'input.phy', { chunk => 50 } );
- short: truncate ids to 10 chars, as in original PHYLIP (defaut: yes) - clean: replace all ambiguous and missing states by 'X' (default: false) - chunk: line width (default: 60 chars; negative values means no wrap)
To store the Ali in PHYLIP sequential format, specify a negative chunk (-1).
Class method (constructor) returning a new Ali read from a file in the STOCKHOLM format. =GF comments are retained (see above) but not the other comment classes (=GS, =GR and =GC).
use aliased 'Bio::MUST::Core::Ali'; my $ali = Ali->load('upsk.stockholm'); say $ali->header; # outputs: # ID UPSK # SE Predicted; Infernal # SS Published; PMID 9223489 # RN [1] # RM 9223489 # RT The role of the pseudoknot at the 3' end of turnip yellow mosaic # RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA # RT polymerase. # RA Deiman BA, Kortlever RM, Pleij CW; # RL J Virol 1997;71:5990-5996.
Class method intended to transform a large sequence file read from disk without loading it in memory. This method will transparently process plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
my $chunk = 200; my $split = sub { my $seq = shift; my $base_id = ( split /\s+/xms, $seq->full_id )[0]; my $max_pos = $seq->seq_len - $chunk; my $n = 0; my $out_str; for (my $pos = 0; $pos <= $max_pos; $pos += $chunk, $n++) { $out_str .= ">$base_id.$n\n" . $seq->edit_seq($pos, $pos + $chunk <= $max_pos ? $chunk : 2 * $chunk ) . "\n"; } return $out_str; }; use aliased 'Bio::MUST::Core::Ali'; Ali->instant_store( 'outfile.fasta', { infile => 'infile.fasta', coderef => $split } );
This method requires two arguments. The sercond is a hash reference that must contain the following keys: - infile: input sequence file - coderef: subroutine implementing the transforming logic
Alias for count_seqs method. For API consistency.
count_seqs
Denis BAURAIN <denis.baurain@uliege.be>
Catherine COLSON <ccolson@doct.uliege.be>
Arnaud DI FRANCO <arnaud.difranco@gmail.com>
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Bio::MUST::Core, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::MUST::Core
CPAN shell
perl -MCPAN -e shell install Bio::MUST::Core
For more information on module installation, please visit the detailed CPAN module installation guide.