The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Mashtree::Db - functions for Mashtree databasing

SYNOPSIS

  use strict;
  use warnings
  use Mashtree::Db;

  my $dbFile = "mashtree.tsv";
  my $db=Mashtree::Db->new($dbFile);

  # Add 10 distances from genome "test" to other genomes
  my %distHash;
  for(my $dist=0;$dist<10;$dist++){
    my $otherGenome = "genome" . $dist;
    $distHash{"test"}{$otherGenome} = $dist;
  }
  $db->addDistancesFromHash(\$distHash);

  my $firstDistance = $db->findDistance("test", "genome0");
  # => 0

DESCRIPTION

This is a helper module, usually not used directly. This is how Mashtree reads and writes to the internal database.

METHODS

Mashtree::Db->new($dbFile, \%settings)

Create a new Mashtree::Db object.

The database file is a tab-separated file and will be created if it doesn't exist. If it does exist, then it will be read into memory.

Arguments:

  * $dbFile - a file path
  * $settings - a hash of key/values (currently unused)
$db->selectDb

Selects a database. If it doesn't exist, then it will be created. Then, it sets the object property `dbFile` to the file path.

$db->readDatabase

Reads the database from the dbFile set by `selectDb`. Returns a hash of distances, e.g., genome1 => {genome2=>dist}

Then, this hash of distances is set in the object property `cache`.

addDistancesFromHash

Add distances from a perl hash, $distHash $distHash is { genome1 => {$genome2 => $dist} }

$db->addDistances

Add distances from a TSV file. TSV file should be a mash distances tsv file and is in the format of, e.g., # query t/lambda/sample1.fastq.gz t/lambda/sample2.fastq.gz 0.059 t/lambda/sample3.fastq.gz 0.061

$db->findDistance

Find the distance between any two genomes. Return undef if not found.

$db->findDistances

Find all distances from one genome to all others Return undef if not found.

$db->toString

Turn the database into a string representation.

Arguments:

  * genomeArray - list of genomes to include, or undef for all genomes
  * format - can be a string of one of these values:
    * tsv    3-column format (default)
    * matrix all-vs all tsv format
    * phylip Phylip matrix format
  * sortBy - can be:
    * abc (default)
    * rand