The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::Minimizer - minimizer package

Based on the ideas put forth by Roberts et al 2004: https://academic.oup.com/bioinformatics/article/20/18/3363/202143

SYNOPSIS

    my $minimizer = Bio::Minimizer->new($sequenceString);
    my $kmers     = $minimizer->{kmers};     # hash of minimizer => kmer
    my $minimizers= $minimizer->{minimizers};# hash of minimizer => [kmer1,kmer2,...]

    # hash of minimizer => [start1,start2,...] 
    # Start coordinates are on the fwd strand even when
    # matched against the rev strand.
    my $starts    = $minimizer->{starts}; 

    # With more options
    my $minimizer2= Bio::Minimizer->new($sequenceString,{k=>31,l=>21});

DESCRIPTION

Creates a set of minimizers from sequence

EXAMPLES

example: Sort a fastq file by minimizer, potentially shrinking gzip size.

This is implemented in this package's scripts/sort*.pl scripts.

    use Bio::Minimizer

    # Read fastq file via stdin, in this example
    while(my $id = <>){
      # Grab an entry
      ($seq,$plus,$qual) = (scalar(<>), scalar(<>), scalar(<>)); 
      chomp($id,$seq,$plus,$qual); 

      # minimizer object
      $MINIMIZER = Bio::Minimizer->new($seq,{k=>length($seq)}); 
      # The only minimizer in this entry because k==length(seq)
      $minMinimizer = (values(%{$$MINIMIZER{minimizers}}))[0]; 

      # combine the minimum minimizer with the entry, for
      # sorting later.
      # Save the entry as a string so that we don't have to
      # parse it later.
      my $entry = [$minMinimizer, "$id\n$seq\n$plus\n$qual\n"];
      push(@entry,$entry);
    }
    
    for my $e(sort {$$a[0] cmp $$b[0]} @entry){
      print $$e[1];
    } 

METHODS

Bio::Minimizer->new()
    Arguments:
      sequence     A string of ACGT
      settings     A hash
        k          Kmer length
        l          Minimizer length (some might call it lmer)
        numcpus    Number of threads to use. (not used)