The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::LITE::Taxonomy::NCBI::Gi2taxid - Mappings of NCBI GI's to Taxids fast and with very low memory footprint.

SYNOPSIS

Creation of a new Taxid to GI dictionary (binary mapping file):

  use Bio::LITE::Taxonomy::NCBI::Gi2taxid qw/new_dict/;

  new_dict (in => "gi_taxid_prot.dmp",
            out => "gi_taxid_prot.bin");

Usage of the dictionary:

  use Bio::LITE::Taxonomy::NCBI::Gi2taxid;

  my $dict = Bio::LITE::Taxonomy::NCBI::Gi2taxid->new(dict=>"gi_taxid_prot.bin");
  my $taxid = $dict->get_taxid(12553);

DESCRIPTION

The NCBI site offers a file to map gene and protein sequences (GIs) with their corresponding taxon of origin (Taxids). If you want to use this information inside a Perl script you will find that (given the high amount of sequences available) it is fairly inefficient to store this information in, for example, a regular hash. Only for creating such a hash you will need more than 10 GBs of system memory.

This is a very simple module that has been designed to efficiently map NCBI GIs to Taxids with speed as the primary goal. It is designed to retrieve taxids from GIs very fast and with low memory usage. It is even faster than using a SQL database to retrieve the mappings or using a local DBHash.

To achieve this, it uses a binary index that can be created with the function new_dict. This index has to be created one time for each mapping file.

The original mapping files can be downloaded from the NCBI site at the following address: ftp://ftp.ncbi.nih.gov/pub/taxonomy/.

FUNCTIONS

new_dict

This function creates a new binary dictionary from the NCBI mapping file. The file should be uncompressed before being passed to the script. The function accepts the following parameters:

*NOTE* From version 0.05, the lib uses a more compacted memory file. This means that binary files created with earlier versions will not work with this one and vice-versa. You need to create the new binary db with this version.

in

This is the uncompressed mapping file from the NCBI. The function accepts a filename or a filehandle

out

Optional. Where the binary dictionary is going to be printed. The function accepts a filename or a filehandle (that should be opened with writing permissions). If absent STDOUT will be assumed.

chunk_size

Optional. While bin conversion, the lib stores chunks of data in memory to speed up the conversion. This number specifies the size of the chunks. By default 30Mb is used. The whole chunk is stored in a Perl scalar so be careful not to overflow the scalar capacity.

CONSTRUCTOR

new

Once the binary dictionary is created it can be used as an object using this constructor. It accepts the following parameters

dict

This is the binary dictionary obtained with the new_dict function. The name of the file or a filehandle is accepted.

save_mem

Optional. Use this option to avoid to load the binary dictionary into memory. This will save almost 1GB of system memory but looking up for Taxids will be ~20% slower. This option is off by default.

METHODS

get_taxid

This method receives a GI and returns the corresponding Taxid.

SEE ALSO

DBHash Bio::DB::Taxonomy

AUTHOR

Miguel Pignatelli

Any comments should be addressed to emepyc@gmail.com

LICENSE

Copyright 2013 Miguel Pignatelli, all rights reserved.

This library is free software; you may redistribute it and/or modify it under the same terms as Perl itself.