The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

String::Simrank - Rapid comparisons of many strings for k-mer similarity.

SYNOPSIS

  use String::Simrank;
  my $sr = new String::Simrank(
              { data => 'db.fasta'});
  $sr->formatdb();
  $sr->match_oligos({ query => 'query.fasta' });

  You may utilize  k-mers of various sizes, e.g.:
  $sr->formatdb( { wordlen => 6 } );

DESCRIPTION

The String::Simrank module allows rapid searches for similarity between query strings and a database of strings. This module is maintained by molecular biologists who use it for searching for similarities among strings representing contiguous DNA or RNA sequences. This program does not construct an alignment but rather finds the ratio of nmers (A.K.A. ngrams) that are shared between a query and database records. The input file should be fasta formatted (either aligned or unaligned) multiple sequence file. The memory consumption is moderate and grows linearly with the number of sequences and depends on the nmer size defined by the user. Using 7-mers, ~20,000 strings each ~1,500 characters in length requires ~50 Mb.

The module can be used from the command line through the script simrnak_nuc.pl provided in the examples folder.

By default the output is written to STDOUT and represents the similarity of each query string to the top hits in the the database. The format is query_id, tab, best match database_id, colon, percent similarity, space second best match database_id, colon, percent similarity. query1 EscCol36:100.00 EscCol36:100.00 EscCol43:99.59 EscCol29:99.24 EscCol33:99.17 EscCol10:99.02

METHODS

new( { data_path => $path })

  my $sr = new String::Simrank( { data => 'many_seqs.fasta' });

Instantiates a new String::Simrank object. Each simrank database should live within its own String::Simrank object. In other words, each unique combination of a fasta database and n-mer length will have its own instance. If the database has already been formated then the wordlength will be read from the formatted database's .bin file. Parameters:

data

Sets the path of where the data fasta file is (Database sequences, fasta format). The same terminal directory is where the formated database will reside with the same base name but the .fasta will be substituted with .bin. Sequence names(identifiers) will be first 10 characters matched between the '>' and the first space character. Be sure the sequences names are unique within this string.

_check_arguments_for_new

Internally used function. Validates the 'data' argument needed so this instance knows what data file will be linked. Method checks that fasta file exists and is readable and creates the name which corresponds to the .bin file. If errors are found, program exits. Returns a hash or hash ref depending on the context of the method call.

formatdb({ wordlen => $int, minlen => $int, silent = $bool })

  $sr->formatdb({wordlen => 8, minlen => 500, silent => 0, valid_chars => 'ACGT'});

From an input collection of strings (data), generates a binary file optimized for retrieval of k-mer similarities. The file will contain a pre-computed map of which sequences - given by their index number in the input data file - contain a given k-mer. All the valid k-mers of the given length are mapped. This means each k-mer from a query string can be used to look up the database strings it occurs in, very quickly. The memory consumption is moderate and grows linearly with the number of sequences; 20,000 takes 50-55 mb. The first characters between the '>' character and the first space is recognized as the string (sequence) identifier and should be unique for each record. The number of sequences found in the input collection is returned and the file xyz.bin is written to disk.

Parameters:

pre_subst

An integer that determines how to interpret some special characters in the strings. The substitution is done before k-mers are extracted. Use 1 to eliminate all periods(.), hypens(-), and space characters(\s, which includes \s,\t,\n). This is the default behavior. Use 2 to convert same characters as above into underscores(_).

wordlen

An integer that specifies the k-mer length to index for each record in the input data file. Default is 7.

minlen

Specifies the minimum string length for a database sequence to be worthy of indexing. Default is 50.

valid_chars

Specifies the characters to allow for formatting. Entire k-mer must be composed of these characters otherwise it will be ignored. When not defined, all characters are valid. Default is undef (indicating all characters are valid)

silent

Specifies if caller wants progress messages printed to STDERR while formatting. Default is false, meaning not silent.

_check_arguments_for_formatdb

It checks that word length and minimum sequence length is reasonable. Sets silent to true by default. Sets pre_subst to 1 by default. Defines binary file name. If errors are found, program exits. Returns a hash or hash ref depending on the context of the method call.

match_oligos({query => $path})

  $sr->match_oligos({query => 'query.fasta'});
  $sr->match_oligos({ query => 'query.fasta',
                      outlen => 10,
                      minpct => 95,
                      reverse => 1,
                      outfile => '/home/donny/sr_results.txt',
                    });
  
  my $matches = $sr->match_oligos({query => 'query.fasta'});
  print Dumper($matches);
  foreach my $k (keys %{$matches} ) {
    print "matches for $k :\n";
    foreach my $hit ( @{ $matches->{$k} } ) {
      print "hit id:" . $hit->[0] . " perc:" . $hit->[1] . "\n";
    }
  }  
    

This routine quickly estimates the overall similarity between a given set of DNA or RNA sequence(s) and a background set of database sequences (usually homologues). It returns a sorted list of similarities as a table. The similarity between sequences A and B are the number of unique k-words (short subsequence) that they share, divided by the smallest total unique k-word count in either A or B. The result are scores that do not depend on sequence lengths. When called in void context, the routine prints to the given output file or STDOUT; otherwise a hash_ref is returned.

Parameters:

query

Sets the path where the query fasta file will be found. Required. In future versions, query could be a data structure instead. Need to build an abstract iterator for this feature to be enabled.

minpct

A real number indicating the the minimum percent match that should be output/returned. Default = 50.

outlen

An integer indicating the maximum number of ranked db matches that should be output/returned. Default = 100.

valid_chars

Specifies the characters to allow for kmer-searching. Entire n-mer must be composed of these characters otherwise it will be ignored. When not defined, all characters are valid. Default is undef (indicating all characters are valid)

reverse

A boolean value. If true, reverses input sequence. Default = false.

noids

A boolean value. If true, prints database index numbers instead of sequence ids. Default = false.

outfile

Sets the path of the output file (instead of sending to STDOUT). Default = false. Meaning output is sent to STDOUT.

silent

Specifies if caller wants progress messages printed to STDERR while matching. Default is false, meaning not silent.

_check_arguments_for_match_oligos

Internally used function. Validates the arguments for outlen, minpct, reverse, query, noids, silent, pre_subst and output. If errors are found, program exits. Returns a hash or hash ref depending on the context of the method call.

_create_oligos

Internal method. Creates a list of unique words of a given length from a given sequence. Enforces k-mers composed of purely valid_char if requested. Converts characters to upper case where possible. This may become an option in the future.

_nearest_index

Finds the index of a given sorted array where a given number would fit in the sorted order. The returned index can then be used to insert the number, for example. The array may contain integers and/or floats, same for the number.

_count_records

Counts the number of records in a fasta formated file.

_update_scores_C

Receives a list of indicies, score vector, and the final index Updates the score vector by incrementing the oligo match count for the correct sequences.