The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

NCBIx::BigFetch - Robustly retrieve very large NCBI sequence result sets based on keyword searches using NCBI eUtils.

SYNOPSIS

  use NCBIx::BigFetch;
  
  # Parameters
  my $params = { project_id => "1", 
                 base_dir   => "/home/user/data", 
                 db         => "protein",
                 query      => "apoptosis",
                 return_max => "500" };
  
  # Start project
  my $project = NCBIx::BigFetch->new( $params );
  
  # Love the one you're with
  print " AUTHORS: " . $project->authors() . "\n";
  
  # Attempt all batches of sequences
  while ( $project->results_waiting() ) { $project->get_next_batch(); }
  
  # Get missing batches 
  while ( $project->missing_batches() ) { $project->get_missing_batch(); }
  
  # Find unavailable ids
  my $ids = $project->unavailable_ids();
  
  # Retrieve unavailable ids
  foreach my $id ( @$ids ) { $project->get_sequence( $id ); }

DESCRIPTION

NCBIx::BigFetch is useful for downloading very large result sets of sequences from NCBI given a text query. Its first use had over 11,000,000 sequences as the result of a single keyword search. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch.

Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. The project_id and base_dir keys are the only required keys, although you will get the same search for "apoptosis" everytime unless you also set the "query" key. In any case, once a project is started, it only needs the two parameters to be reloaded.

Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.

Results are retrived in batches depending on the "return_max" key. By default, the "index" starts at 1 and downloads continue until the index exceedes "count".

Occasionally errors happen and entire batches are not downloaded. In this case, the "index" is added to the "missing" list. This list is saved in the configuration file. The missing batches should be downloaded every day, and not saved until the end of the complete run.

Working scripts are included in the script directory:

        fetch-all.pp
        fetch-missing.pp
        fetch-unavailable.pp

The recommended workflow is:

        1. Copy the scripts and edit them for a specific project. Use 
           a new number as the project ID. 

        2. Begin downloading by running fetch-all.pp, which will first 
           submit a query and save the resulting WebEnv key in a project 
           specific configuration file (using YAML).

        3. The next morning, kill the fetch-all.pp process and run 
           fetch-missing.pp until it completes.  

        4. Restart fetch-all.pp.  

If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally.

If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). (Planned upgrade will automagically start another search.)

Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the batches generated with the same query within a few days of each other are largely identical.

MAIN METHODS

These are the primary methods that implement the highest abilities of the module. They are the ones found in the included scripts.

  • new()

      my $project = NCBIx::BigFetch->new( $params );

    The parameters hash reference should include the following minimum keys: project_id, base_dir, db, and query.

  • results_waiting()

      while ( $project->results_waiting() ) { ... }

    This method is used to determine if all of the batches have been attempted. It compares the current index to the total count, and is TRUE if the index is less than the count.

  • get_next_batch()

      $project->get_next_batch();

    Attempts to retrieve the next batch of "retmax" sequences, starting with the current index, which is updated every time a batch is downloaded. When used as in the Synopsis above, the index is both kept in memory and updated in the configuration file. If the download is interrupted and restarted, the correct index will be used and no data will be lost.

  • note_missing_batch()

      $project->note_missing_batch( $index );

    Adds the batch index to the list of missing batches.

  • missing_batches()

      while ( $project->missing_batches() ) { ... }

    This method is used to determine if any batches have been noted as "missing". It measures the "missing" list (which is stored in the configuration file) and returns TRUE when at leat one batch is listed. The batches are listed by starting index, which together with the return_max setting is used to describe a batch.

  • get_missing_batch()

      $project->get_missing_batch();

    Warning: do not kill the script during this phase.

    Gets a single batch, using the first index on the "missing" list. The index is shifted off the list and then attempted, so if you break during this phase you may actually lose track of the batch.

    Recovery: edit the configuration file and add the index back to the missing list. The index will be reported to STDOUT in the status message.

  • get_batch()

      $project->get_batch( $index );

    Gets a single batch using the index parameter. This routine may be called on its own, but it is intended to only be used by get_next_batch() and get_missing_batch().

  • unavailable_ids()

      my $ids = $project->unavailable_ids();

    Notice that this method depends on a loaded (or started) project. It reads through all data files and creates a list of individual sequences that were unavailable when a batch was reported. The list is returned as a perl list reference.

  • authors()

      $project->authors();

    Surely you can stand a few bytes of vanity for the price of free software! Actually, the email addresses are of the "lifetime" sort, so feel free to contact the authors with any questions or concerns.

ADDITIONAL METHODS

These methods are not meant to be used in a stand alone fashion, but if they did, it would look like this.

  • get_sequence()

      $project->get_sequence( $id );

    Notice that this method depends on a loaded (or started) project. It retrieves the sequence by id and saves it to a special data file which uses "0" as an index. All unavailable sequences retrieved this way are saved to this file, so it could potentially be larger than the rest.

      use NCBIx::BigFetch;
      
      my $id = 'AC123456';  # Get this however you want
    
      # Parameters
      my $params = { project_id => "1", 
                     base_dir   => "/home/user/data" };
      
      # Start project
      my $project = NCBIx::BigFetch->new( $params );
    
      # Get sequence
      my $sequence = $project->get_sequence( $id );
    
      exit;

    This method adds always adds the sequence to a special file with batch index of 0.

  • next_index()

      $project->next_index();

    Gets the next result index by adding the return_max value to the current index. The index is relative to the search results, and is the index of the first sequence in the returned batch (which serves as the batch id).

  • data_filename()

      $project->data_filename();

    Creates a filename for a given batch based on project_id and result index.

  • esearch_filename()

      $project->esearch_filename();

    Creates a filename for saving the intial search request.

  • config_filename()

      $project->config_filename();

    Creates a filename for the configuration file based on the project_id.

PUBLIC PROPERTIES

All of the properties have get_/set_ methods courtesy of Class:Std and the :ATTR feature.

These properties have defaults but each may be overriden by passing them as keys in a hashref to new(). (See the variable $params in the SYNOPSIS above.)

  • project_id

    The project_id is used to distinguish sets of data within a single data directory. It is part of each filename associated with the project. The default is "1". It is recommended that you always set project_id.

  • base_dir

    The base directory where project data will be saved. The default is /home/username. It is recommended that you always set base_dir.

  • base_url

    The base URL for NCBI eUtils. The default is "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/".

  • db

    Gets the eSearch database setting. The default is "protein".

  • index

    Gets the current result index. The index is reset after every attempted batch by retmax amount. The default is "1".

  • missing

    Gets the list of missing batch indices. This property is stored as an arrayref. The default is "[]".

  • query

    Gets the query string used for eSearch. The default is "apoptosis".

  • return_max

    Gets the retmax setting used to limit the batch size. The default is "500".

  • return_type

    Gets the rettype setting used to determine the format of fetched sequences. The default is "fasta".

  • return_mode

    Gets the retmode setting used to determine the format of fetched sequences. The default is "text".

PUBLIC PROPERTIES

These properties are set by the code.

  • querykey

    The querykey property is parsed from the eSearch result. It is currently expected to always be be 1 (since only query is ever submitted by NCBIx::BigFetch).

  • count

    The count property is parsed from the eSearch result and represents the total number of results for the query.

  • webenv

    The WebEnv property is parsed from the eSearch result. It is used to build the eFetch URL for retrieving batches of sequences. It represens a pointer to the results, which are stored on NCBI's servers for a few days before being deleted.

  • start_date

    Calculates the start date for the project.

  • start_time

    Calculates the start time for the project.

EXPORT

None

SEE ALSO

  • http://bioinformatics.ualr.edu/

  • http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

  • http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

  • http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl

AUTHORS

Feel free to email the authors with questions or concerns. Please be patient for a reply.

  • Roger Hall (roger@iosea.com), (rahall2@ualr.edu)

  • Michael Bauer (mbkodos@gmail.com), (mabauer@ualr.edu)

  • Kamakshi Duvvuru (kduvvuru@gmail.com)

COPYRIGHT AND LICENSE

Copyleft (C) 2009 by the Authors

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.5 or, at your option, any later version of Perl 5 you may have available.