NAME

NCBIx::BigFetch - Retrieve very large NCBI sequence result sets based on keyword search

SYNOPSIS

use NCBIx::BigFetch;

# Parameters
my $params = { project_id => "1", 
               base_dir   => "/home/user/data", 
	         db         => "protein",
	         query      => "apoptosis",
               return_max => "500" };

# Start project
my $project = NCBIx::BigFetch->new( $params );

# Love the one you're with
print " AUTHORS: " . $project->authors() . "\n";

# Attempt all batches of sequences
while ( $project->results_waiting() ) { $project->get_next_batch(); }

# Get missing batches 
while ( $project->missing_batches() ) { $project->get_missing_batch(); }

# Find unavailable ids
my $ids = $project->unavailable_ids();

# Retrieve unavailable ids
foreach my $id ( @$ids ) { $project->get_sequence( $id ); }

DESCRIPTION

NCBIx::BigFetch uses the esearch and efetch services of NCBI to retrieve sequences by keyword. It was designed for very large result sets; the first project it was used on had over 11,000,000 sequences.

Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.

Results are retrived in batches depending on the "retmax" size.

FUNCTIONS

  • new()

    my $project = NCBIx::BigFetch->new( $params );

    The parameters hash reference should include the following minimum keys: project_id, base_dir, db, and query.

  • results_waiting()

    while ( $project->results_waiting() ) { ... }

    This method is used to determine if all of the batches have been attempted. It compares the current index to the total count, and is TRUE if the index is less than the count.

  • get_next_batch()

    $project->get_next_batch();

    Attempts to retrieve the next batch of "retmax" sequences, starting with the current index, which is updated every time a batch is downloaded. When used as in the Synopsis above, the index is both kept in memory and updated in the configuration file. If the download is interrupted and restarted, the correct index will be used and no data will be lost.

  • missing_batches()

    while ( $project->missing_batches() ) { ... }

    This method is used to determine if any batches have been noted as "missing". It measures the "missing" list (which is stored in the configuration file) and returns TRUE when at leat one batch is listed. The batches are listed by starting index, which together with the return_max setting is used to describe a batch.

  • get_missing_batch()

    $project->get_missing_batch();

    Warning: do not kill the script during this phase.

    Gets a single batch, using the first index on the "missing" list. The index is shifted off the list and then attempted, so if you break during this phase you may actually lose track of the batch.

    Recovery: edit the configuration file and add the index back to the missing list. The index will be reported to STDOUT in the status message.

  • unavailable_ids()

    my $ids = $project->unavailable_ids();

    Notice that this method depends on a loaded (or started) project. It reads through all data files and creates a list of individual sequences that were unavailable when a batch was reported. The list is returned as a perl list reference.

  • get_sequence()

    $project->get_sequence( $id );

    Notice that this method depends on a loaded (or started) project. It retrieves the sequence by id and saves it to a special data file which uses "0" as an index. All unavailable sequences retrieved this way are saved to this file, so it could potentially be larger than the rest.

  • authors()

    $project->authors();

    Surely you can stand a few bytes of vanity for the price of free software!

EXPORT

None

SEE ALSO

http://bioinformatics.ualr.edu/

AUTHOR

Roger Hall <roger@iosea.com> Michael Bauer <mbkodos@gmail.com> Kamakshi Duvvuru <kduvvuru@gmail.com>

COPYRIGHT AND LICENSE

Copyleft (C) 2009 by the Authors

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.5 or, at your option, any later version of Perl 5 you may have available.