import_ncbi_mv_hs.pl -- make gff files from NCBI Map Viewer data files.
perl import_ncbi_mv_hs.pl --type type [options]
Download from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/ (or the most current directory) the files
seq_gene.md.gz gene.q.gz
to the same directory as import_ncbi_mv_hs.pl and execute the command
perl import_ncbi_mv_hs.pl --type gene
This creates the file seq_gene.gff which can be loaded into a gbrowse database using bp_load_gff.pl.
This script reads two kinds of input files from the NCBI Map Viewer FTP site. The source for human input files is
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview
which contains subdirectories for the various builds. For example, mapview/BUILD.34.3/seq_gene.md.gz would be an input file for use with the subroutine mk_seq_gene.
At the moment this script will import the files seq_gene.md (essentially records from the Entrez Gene database) and seq_sts.md (the UniSTS database). However there are many other kinds of data available from the Map Viewer FTP site.
This script does not load the gff files into the database. This can be achieved by running the script bp_load_gff.pl with the output files (gff files) from import_ncbi_mv_hs.pl.
The argument 'type' to the option '--type' indicates what kind of Map Viewer file to import.
type Map Viewer file ---- --------------- gene seq_gene.md. The path of this file can be indicated with the --seq_gene option. The script can read directly from the compressed version seq_gene.md.gz. sts seq_sts.md. Similary, use the --seq_sts option to specify the path.
Options (default)
--type Type of file: gene, sts. Explained above. --seq_gene Path for file seq_gene.md, text or *.gz (seq_gene.md.gz) --gene_q Path for file gene.q, text or *.gz (gene.q.gz). See hs_mk_seq_gene. --seq_sts Path for file seq_sts.md, text or *.gz (seq_sts.md.gz) --chromosome Only import records for this chromosome --gff Path of gff file to create (default=seq_gene.gff for type=gene, etc) --min_pos Minimum chromosomal position to import --max_pos Maximum chromosomal position to import
Example:
perl import_ncbi_mv_hs.pl --type gene --chr 2 --gff seq_gene_chr2.gff
This imports the file seq_gene.gz
Scott Saccone (ssaccone@han.wustl.edu)
hs_mk_seq_gene(-seq_gene=>'seq_gene.md.gz', -gene_q=>'gene.q', -gff=>'seq_gene_chr1.gff', -assembly=>'reference', -chromosome=>1, -min_pos=>undef, -max_pos=>undef );
This converts the human Map Viewer file seq_gene.md to gff format. The gff source field is named "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. Optionally, gene descriptions can be obtained from the Map Viewer file 'gene.q' in which case the group field of the gff gets a 'Note' attribute; for example 'Note "similar to beta-tubulin 4Q"'.
Format of seq_gene.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 transcript 14 weight
Notes on the fields:
featureId: has the form GeneID:n where n is the Entrez Gene ID. This is sometimes the same as the LocusLink ID but I believe LocusLink is being phased out and these IDs may not always agree. Features that are grouped together by a common featureId will have a common group id in the gff file. Then the transcript aggregator can then be applied. featureType: is used to define the method field in the gff record. The values I've seen are GENE,UTR,CDS and PSEUDO. I think the current transcript aggregator only recognizes CDS (the GENE records use the 'transcript' method). Perhaps UTR must be converted to 5'UTR and 3'UTR somehow. groupLabel: the 'assembly' I believe: 'reference', 'HSC_TCAG' or 'DR51'.
Options (default): -seq_gene mapview file with gene locations, text or *.gz file (seq_gene.md.gz) -gene_q mapview file with gene descriptions, text or *.gz file (gene.q.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly which assembly to use (reference)
Read Map Viewer file seq_q and store the full gene descriptions. Used by hs_mk_seq_gene.
Format of seq_q: tab delimited header at line 1 field 0: GeneID field 7: full description
hs_mk_seq_sts(-seq_sts=>'seq_sts.md.gz', -gff=>'seq_sts_chr1.gff', -assembly=>'reference', -chromosome=>1, -min_pos=>undef, -max_pos=>undef );
Convert human Map Viewer file seq_sts.md to gff format. The gff source is 'sts' and the gff method is "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. The group field is of the form 'STS "name"; Name "name"' where name is the featureName field from the Map Viewer file. The group fields will also contain 'UniSTS_ID n' if the UniSTS ID is available in the Map Viewer record.
Format of seq_sts.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 weight
featureId: has the form UniSTS:n where n is the UniSTS ID. groupLabel: see hs_mk_seq_gene.
Options (default): -seq_sts Map Viewer file with sts locations. Can read directly from *.gz file (seq_sts.md.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly assembly to use (reference)
To install CGI::Toggle, copy and paste the appropriate command in to your terminal.
cpanm
cpanm CGI::Toggle
CPAN shell
perl -MCPAN -e shell install CGI::Toggle
For more information on module installation, please visit the detailed CPAN module installation guide.