The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

NCBIx::BigFetch - Retrieve very large NCBI sequence result sets based on keyword search

SYNOPSIS

  use NCBIx::BigFetch;
  
  # Parameters
  my $params = { project_id => "1", 
                 base_dir   => "/home/user/data", 
                 db         => "protein",
                 query      => "apoptosis",
                 return_max => "500" };
  
  # Start project
  my $project = NCBIx::BigFetch->new( $params );
  
  # Love the one you're with
  print " AUTHORS: " . $project->authors() . "\n";
  
  # Attempt all batches of sequences
  while ( $project->results_waiting() ) { $project->get_next_batch(); }
  
  # Get missing batches 
  while ( $project->missing_batches() ) { $project->get_missing_batch(); }
  
  # Find unavailable ids
  my $ids = $project->unavailable_ids();
  
  # Retrieve unavailable ids
  foreach my $id ( @$ids ) { $project->get_sequence( $id ); }

DESCRIPTION

NCBIx::BigFetch uses the esearch and efetch services of NCBI to retrieve sequences by keyword. It was designed for very large result sets; the first project it was used on had over 11,000,000 sequences.

Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.

Results are retrived in batches depending on the "retmax" size.

MAIN FUNCTIONS

  • new()

      my $project = NCBIx::BigFetch->new( $params );

    The parameters hash reference should include the following minimum keys: project_id, base_dir, db, and query.

  • results_waiting()

      while ( $project->results_waiting() ) { ... }

    This method is used to determine if all of the batches have been attempted. It compares the current index to the total count, and is TRUE if the index is less than the count.

  • get_next_batch()

      $project->get_next_batch();

    Attempts to retrieve the next batch of "retmax" sequences, starting with the current index, which is updated every time a batch is downloaded. When used as in the Synopsis above, the index is both kept in memory and updated in the configuration file. If the download is interrupted and restarted, the correct index will be used and no data will be lost.

  • note_missing_batch()

      $project->note_missing_batch( $index );

    Adds the batch index to the list of missing batches.

  • missing_batches()

      while ( $project->missing_batches() ) { ... }

    This method is used to determine if any batches have been noted as "missing". It measures the "missing" list (which is stored in the configuration file) and returns TRUE when at leat one batch is listed. The batches are listed by starting index, which together with the return_max setting is used to describe a batch.

  • get_missing_batch()

      $project->get_missing_batch();

    Warning: do not kill the script during this phase.

    Gets a single batch, using the first index on the "missing" list. The index is shifted off the list and then attempted, so if you break during this phase you may actually lose track of the batch.

    Recovery: edit the configuration file and add the index back to the missing list. The index will be reported to STDOUT in the status message.

  • get_batch()

      $project->get_batch( $index );

    Gets a single batch using the index parameter. This routine may be called on its own, but it is intended to only be used by get_next_batch() and get_missing_batch().

  • unavailable_ids()

      my $ids = $project->unavailable_ids();

    Notice that this method depends on a loaded (or started) project. It reads through all data files and creates a list of individual sequences that were unavailable when a batch was reported. The list is returned as a perl list reference.

  • get_sequence()

      $project->get_sequence( $id );

    Notice that this method depends on a loaded (or started) project. It retrieves the sequence by id and saves it to a special data file which uses "0" as an index. All unavailable sequences retrieved this way are saved to this file, so it could potentially be larger than the rest.

  • clean_sequences()

      $project->clean_sequences();

    Removes non-sequence text from sequence files and optionally removes sequences with ambiguous characters.

  • authors()

      $project->authors();

    Surely you can stand a few bytes of vanity for the price of free software!

  • BUILD()

      my $project = NCBIx::BigFetch->new( $params );

    This method is *not* called directly, but rather included in the new() method thanks to Class::Std.

PROPERTY FUNCTIONS

These get/set functions manage the modules properties.

  • get_base_dir()

      $project->get_base_dir();

    Gets the base directory for project data.

  • get_base_url()

      $project->get_base_url();

    Gets the base URL for NCBI eUtils.

  • get_clean_filename()

      $project->get_clean_filename();

    Creates a filename for a project's sequences to store "cleaned" sequences.

  • get_config_filename()

      $project->get_config_filename();

    Creates a filename for the configuration file based on the project_id.

  • get_count()

      $project->get_count();

    Returns the count of results for the query.

  • get_data_filename()

      $project->get_data_filename();

    Creates a filename for a given batch based on project_id and result index.

  • get_db()

      $project->get_db();

    Gets the eSearch database setting.

  • get_esearch_filename()

      $project->get_esearch_filename();

    Creates a filename for saving the intial search request.

  • get_index()

      $project->get_index();

    Gets the current result index. The index is reset after every attempted batch by retmax amount.

  • get_missing()

      $project->get_missing();

    Gets the list of missing batch indices.

  • get_project_id()

      $project->get_project_id();

    Gets the project_id for the loaded project.

  • get_query()

      $project->get_query();

    Gets the query string used for eSearch.

  • get_querykey()

      $project->get_querykey();

    Gets the querykey setting from the eSearch results.

  • get_return_max()

      $project->get_return_max();

    Gets the retmax setting used to limit the batch size.

  • get_return_type()

      $project->get_return_type();

    Gets the rettype setting used to determine the format of fetched sequences.

  • get_start_date()

      $project->get_start_date();

    Calculates the start date for the project.

  • get_start_time()

      $project->get_start_time();

    Calculates the start time for the project.

  • get_webenv()

      $project->get_webenv();

    Gets the WebEnv key returned from eSearch. It is used to build the eFetch URL for retrieving batches of sequences.

  • next_index()

      $project->next_index();

    Gets the next result index, which defines the batch id.

  • set_index()

      $project->set_index();

    Sets the result index.

  • set_missing()

      $project->set_missing();

    Sets the list of missing batches.

EXPORT

None

SEE ALSO

http://bioinformatics.ualr.edu/

http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

AUTHORS

Roger Hall <roger@iosea.com>

Michael Bauer <mbkodos@gmail.com>

Kamakshi Duvvuru <kduvvuru@gmail.com>

COPYRIGHT AND LICENSE

Copyleft (C) 2009 by the Authors

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.5 or, at your option, any later version of Perl 5 you may have available.