extract_genes.pl - extract genomic sequences from NCBI files using BioPerl
This script is a simple solution to the problem of extracting genomic regions corresponding to genes. There are other solutions, this particular approach uses genomic sequence files from NCBI and gene coordinates from Entrez Gene.
The first time this script is run it will be slow as it will extract species-specific data from the gene2accession file and create a storable hash (retrieving the positional data from this hash is significantly faster than reading gene2accession each time the script runs). The subsequent runs should be fast.
Install BioPerl, full instructions at http://bioperl.org.
Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into your working directory and gunzip it.
Create one or more species directories in the working directory, the directory names do not have to match those at NCBI (e.g. "Sc", "Hs").
Download the nucleotide fasta files for a given species from its CHR* directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a species directory. The sequence files will have the suffix ".fna" or "fa.gz", gunzip if necessary.
Determine the taxon id for the given species. This id is the first column in the gene2accession file. Modify the %species hash in this script such that name of your species directory is a key and the taxon id is the value.
-i Gene id -s Name of species directory -h Help
Example:
extract_genes.pl -i 850302 -s Sc
To install Bio::Seq, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Seq
CPAN shell
perl -MCPAN -e shell install Bio::Seq
For more information on module installation, please visit the detailed CPAN module installation guide.