The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ncbi_2_gff.pl - Massage NCBI chromosome annotation into GFF-format suitable for Bio::DB::GFF

VERSION (CVS-info)

 $RCSfile: process_ncbi_human.pl,v $
 $Revision: 1.1 $
 $Author: lstein $
 $Date: 2008-10-16 17:01:27 $

SYNOPSIS

   perl process_ncbi_human.pl [options] /path/to/gzipped/datafile(s)

DESCRIPTION

This script massages the chromosome annotation files located at

  ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/chromosome_order/

into the GFF-format recognized by Bio::DB::GFF. If the resulting GFF-files are loaded into a Bio::DB:GFF database using the utilities described below, the annotation can be viewed in the Generic Genome Browser (http://www.gmod.org/ggb/) and interfaced with using the Bio::DB:GFF libraries. (NB these NCBI-datafiles are dumps from their own mapviewer database backend, according to their READMEs)

To produce the GFF-files, download all the chr*sequence.gz files from the FTP-directory above. While in that same directory, run the following example command (see also help clause by running script with no arguments):

process_ncbi_human.pl --locuslink [path to LL.out_hs.gz] chr*sequence.gz

This will unzip all the files on the fly and open an output file with the name chrom[$chrom]_ncbiannotation.gff for each, read the LocusLink records into an in-memory hash and then read through the NCBI feature lines, lookup 'locus' features in the LocusLink hash for details on 'locus' features and print to the proper GFF files. LL.out_hs.gz is accessible here at the time of writing:

  ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL.out_hs.gz

Note that several of the NCBI features are skipped from the reformatting, either because their nature is not fully known at this time (TAG,GS_TRAN) or their sheer volume stands in the way of them being accessibly in Bio::DB::GFF at this time (EST similarities). You can easily change this by modifying the $SKIP variable to your liking to add or remove features, but if you add then you will have to add handling for those new features.

To bulk-import the GFF-files into a Bio::DB::GFF database, use the bulk_load_gff.pl utility provided with Bio::DB::GFF

AUTHOR

Gudmundur Arni Thorisson <mummi@cshl.org>

Copyright (c) 2002 Cold Spring Harbor Laboratory

       This code is free software; you can redistribute it
       and/or modify it under the same terms as Perl itself.