NAME

extract_genes.pl - extract genomic sequences from NCBI files using BioPerl

DESCRIPTION

This script is a simple solution to the problem of extracting genomic regions corresponding to genes. There are other solutions, this particular approach uses genomic sequence files from NCBI and gene coordinates from Entrez Gene.

The first time this script is run it will be slow as it will extract species-specific data from the gene2accession file and create a storable hash (retrieving the positional data from this hash is significantly faster than reading gene2accession each time the script runs). The subsequent runs should be fast.

INSTALLATION

Install BioPerl, full instructions at http://bioperl.org.

Download gene2accession.gz

Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into your working directory and gunzip it.

Download sequence files

Create one or more species directories in the working directory, the directory names do not have to match those at NCBI (e.g. "Sc", "Hs").

Download the nucleotide fasta files for a given species from its CHR* directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a species directory. The sequence files will have the suffix ".fna" or "fa.gz", gunzip if necessary.

Determine Taxon id

Determine the taxon id for the given species. This id is the first column in the gene2accession file. Modify the %species hash in this script such that name of your species directory is a key and the taxon id is the value.

Command-line options

  -i   Gene id
  -s   Name of species directory
  -h   Help

Example:

  extract_genes.pl -i 850302 -s Sc

To install Bio::Seq, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Bio::Seq

CPAN shell

perl -MCPAN -e shell
install Bio::Seq

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)