The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

extract_genes.pl - extract genomic sequences from NCBI files using BioPerl

DESCRIPTION

This script is a simple solution to the problem of extracting genomic regions corresponding to genes. There are other solutions, this particular approach uses genomic sequence files from NCBI and gene coordinates from Entrez Gene.

The first time this script is run it will be slow as it will extract species-specific data from the gene2accession file and create a storable hash (retrieving the positional data from this hash is significantly faster than reading gene2accession each time the script runs). The subsequent runs should be fast.

INSTALLATION

Install BioPerl, full instructions at http://bioperl.org.

Download gene2accession.gz

Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into your working directory and gunzip it.

Download sequence files

Create one or more species directories in the working directory, the directory names do not have to match those at NCBI (e.g. "Sc", "Hs").

Download the nucleotide fasta files for a given species from its CHR* directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a species directory. The sequence files will have the suffix ".fna" or "fa.gz", gunzip if necessary.

Determine Taxon id

Determine the taxon id for the given species. This id is the first column in the gene2accession file. Modify the %species hash in this script such that name of your species directory is a key and the taxon id is the value.

Command-line options

  -i   Gene id
  -s   Name of species directory
  -h   Help

Example:

  extract_genes.pl -i 850302 -s Sc