String-Simrank =========================== The String::Simrank module allows rapid searches for similarity between query strings and a database of strings. This module is maintained by molecular biologists who use it for searching for similarities among strings representing contiguous DNA or RNA sequences. This program does not construct an alignment, but rather finds the ratio of n-mers (A.K.A. ngrams) that are shared between a query and database records. The input file should be fasta formatted (either aligned or unaligned) multiple sequence file. The memory consumption is moderate and grows linearly with the number of sequences and depends on the n-mer size defined by the user. Using 7-mers, ~20,000 strings each ~1,500 characters in length requires ~50 Mb. The module can be used from the command line through the script examples/simrnak_nuc.pl provided. By default the output is written to STDOUT and represents the similarity of each query string to the top hits in the the database. The format is query_id, tab, best match database_id, colon, percent similarity, space second best match database_id, colon, percent similarity. Simrank statistic: For those more comfortable with set theoretic descriptions of algorithms we provide such description. Note, however, that we only describe what is calculated, not how, provided description is algorithmicly less efficient than the one actually implemented in this package. In words the simrank statistic can be defined as: Definition: Simrank score is the number of unique n-mers shared between a query and a databases sequences divided by the smallest number of unique n-mers of the two. Let D be a database sequence, and Q be a query sequence. For a given n (length of the n-mer), we compute the number of unique n-mers that occur in each of the sequences. Let these numbers be nmer(D) and nmer(Q), respectively. Further let nvec(D) and nvec(Q) be the indicator vectors of length |A|^n for each possible n-mer, where |A| is the size of the alphabet used and nvec(D)[i] = 1 if D contains n-mer i (in a predefined total enumeration), and 0 otherwise. Then the Simrank statistic between two sequences can be computed as 1 Simrank(D,Q) = ---------------------- <nvec(D), nvec(Q)>, min(nmer(D),nmer(Q)) where <.,.> indicated the inner product. Of course, String:Simrank computes this statistic in more efficient manner than producing explicit enumeration of all possible n-mers. Instead, of this we only focus on the n-mers that are actually observed in either D or Q. INSTALLATION To install this module please use the standard procedure by typing the following: perl Makefile.PL make make test make install DEPENDENCIES Simrank depends on a few packages that are readily available through CPAN. These are Inline Inline::C File::Basename IO::File Fcntl Data::Dumper Storable EXAMPLE Standalone executable script simrank_nuc.pl is located in the examples folder of the distribution package. You can use the test data provided to learn how to use Simrank. In the following example we will build a database from test_data/db.fasta file and search for the similarity with sequences in test_data/query.fasta. To do so type: perl simrank_nuc.pl --data ../test_data/db.fasta --query ../test_data/query.fasta You should see the following output: EscCol36 EscCol36:100.00 EscCol36-2:100.00 EscCol43:99.59 EscCol29:99.24 EscCol33:99.17 EscCol10:99.02 EscCol22:97.52EscCo110:97.03 ShgDysen:96.13 EscCol37:93.48 RmlBacte:80.56 AluAgaro:35.45 Our query file contained only one sequence EscCol36. In the output the id of this sequence comes first, then we see the list of all sequences in the database sorted by their simrank statistic with the query sequence. We can limit the simrank identity of the output by using --minpct flag. For instance: perl simrank_nuc.pl --data ../test_data/db.fasta --query ../test_data/query.fasta --minpct 100 will only output identical sequences. CITATION If you use Simrank in your research please cite: Todd Z. DeSantis, Keith Keller, Gary L. Andersen, Alexander V. Alekseyenko, Niels Larson Simrank: Rapid and sensitive general purpose k-mer search tool. (in preparation) COPYRIGHT AND LICENCE simrank - high performance DNA/RNA similarity match utility Copyright (C) 2007, 2008 Niels Larsen and Todd DeSantis. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.