Robert Buels

NAME

Bio::BLAST::Database - work with formatted BLAST databases

SYNOPSIS

  use Bio::BLAST::Database;

  # open an existing bdb for reading
  my $fs = Bio::BLAST::Database->open(
               full_file_basename => '/path/to/my_bdb',
             );
  # will read from /path/to/my_bdb.nin, /path/to/my_bdb.nsq, etc

  my @filenames = $fs->list_files;

  #reopen it for writing
  $fs = Bio::BLAST::Database->open(
            full_file_basename => '/path/to/my_bdb',
            write => 1,
          );

  # replace it with a different set of sequences
  $fs->format_from_file('myseqs.seq');

  # can also get some metadata about it
  print "db's title is ".$fs->title;
  print "db was last formatted on ".localtime( $fs->format_time );
  print "db file modification was ".localtime( $fs->file_modtime );

DESCRIPTION

Each object of this class represents an NCBI-formatted sequence database on disk, which is a set of files, the exact structure of which varies a bit with the type and size of the sequence set.

This is mostly an object-oriented wrapper for using NCBI's fastacmd and formatdb tools.

ATTRIBUTES

full_file_basename

Full path to the blast database file basename. This is the entire path to the BLAST database files, except for the final suffixes (.nin, .nsq, etc).

   my $basename = $db->full_file_basename;
   #returns '/data/shared/blast/databases/genbank/nr'

create_dirs

true/false flag for whether to create any necessary dirs at format time

write

true/false flag for whether to write any files that are in the way when formatted

title

title of this blast database, if set

indexed_seqs

return whether this blast database is indexed

type

accessor for type of blastdb. must be set in new(), but open() looks at the existing files and sets this

METHODS

open

  Usage: my $fs = Bio::BLAST::Database->open({
                      full_file_basename => $ffbn,
                      write => 1,
                      create_dirs => 1,
                   });
  Desc : open a BlastDB with the given ffbn.
  Args : hashref of params as:
         {  full_file_basename => full path plus basename of files in this blastdb,
            type => 'nucleotide' or 'protein'
            write => default false, set true to write any files in the way,
            create_dirs => default false, set true to create any necessary directories
                           if formatted
         }
  Ret  : Bio::BLAST::Database object
  Side Effects: none if no files are present at the given ffbn.  overwise,
                dies if files are present and write is not specified,
                or if dir does not exist and create_dirs was not specified
  Example:

to_fasta

  Usage: my $fasta_fh = $bdb->to_fasta;
  Desc : get the contents of this blast database in FASTA format
  Ret  : an IO::Pipe filehandle
  Args : none
  Side Effects: runs 'fastacmd' in a forked process, cleaning up its output,
                and passing it to you

format_from_file

  Usage: $db->format_from_file(seqfile => 'mysequences.seq');
  Desc : format this blast database from the given source file,
         into its proper place on disk, overwriting the files already
         present
  Ret  : nothing meaningful
  Args : hash-style list as:
          seqfile => filename containing sequences,
          title   => (optional) title for this blast database,
          indexed_seqs => (optional) if true, formats the database with
                          indexing (and sets indexed_seqs in this obj)
  Side Effects: runs 'formatdb' to format the given sequences,
                dies on failure

file_modtime

  Desc: get the earliest unix modification time of the database files
  Args: none
  Ret : unix modification time of the database files
  Side Effects:
  Example:

format_time

  Usage: my $time = $db->format_time;
  Desc : get the format time of these db files
  Ret  : the value time() would have returned when
         this database was last formatted, or undef
         if that could not be determined (like if the
         files aren't there)
  Args : none
  Side Effects: runs 'fastacmd' to extract the formatting
                time from the database files

  NOTE:  This function assumes that the computer that
         last formatted this database had the same time zone
         set as the computer we are running on.
         Also, the time returned by this function is rounded
         down to the minute, because fastacmd does not print
         the format time in seconds.

check_format_permissions

  Usage: $bdb->check_format_from_file() or die "cannot format!\n";
  Desc : check directory existence and file permissions to see if a
         format_from_file() is likely to succeed.  This is useful,
         for example, when you have a script that downloads some
         remote database and you'd like to check first whether
         we even have permissions to format before you take the
         time to download something.
  Args : (optional) alternate full file basename to write blast DB to
           e.g. '/tmp/mytempdir/tester_blast_db'
  Ret  : nothing if everything looks good,
         otherwise a string error message summarizing the reason
         for failure
  Side Effects: reads from filesystem, may stat some files

is_split

  Usage: print "that thing is split, yo" if $db->is_split;
  Desc : determine whether this database is in multiple parts
  Ret  : true if this database has been split into multiple
         files by formatdb (e.g. nr.00.pin, nr.01.pin, etc.)
  Args : none
  Side Effects: looks in filesystem

files_are_complete

  Usage: print "complete!" if $db->files_are_complete;
  Desc : tell whether this blast db has a complete set of files on disk
  Ret  : true if the set of files on disk looks complete,
         false if not
  Args : (optional) true value if the files should only be
         considered complete if the sequences are indexed for retrieval
  Side Effects: lists files on disk

list_files

  Usage: my @files = $db->list_files;
  Desc : get the list of files that belong to this blast database
  Ret  : list of full paths to all files belonging to this blast database,
  Args : none
  Side Effects: looks in the filesystem

sequences_count

  Desc: get the number of sequences in this blast database
  Args: none
  Ret : number of distinct sequences in this blast database, or undef
        if it could not be determined due to some error or other
  Side Effects: runs 'fastacmd' to get stats on the blast database file

get_sequence

  Usage: my $seq = $fs->get_sequence('LE_HBa0001A02');
  Desc : get a particular sequence from this db
  Args : sequence name to retrieve
  Ret  : Bio::PrimarySeqI-implementing object, or nothing if not found
  Side Effects: dies on error

BASE CLASS(ES)

Class::Accessor::Fast

AUTHOR

Robert Buels <rmb32@cornell.edu>

COPYRIGHT AND LICENSE

This software is copyright (c) 2011 by Robert Buels.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.