The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

import_ncbi_mv_hs.pl -- make gff files from NCBI Map Viewer data files.

SYNOPSIS

perl import_ncbi_mv_hs.pl --type type [options]

A QUICK RUN

Download from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/ (or the most current directory) the files

 seq_gene.md.gz
 gene.q.gz

to the same directory as import_ncbi_mv_hs.pl and execute the command

 perl import_ncbi_mv_hs.pl --type gene

This creates the file seq_gene.gff which can be loaded into a gbrowse database using bp_load_gff.pl.

DESCRIPTION

This script reads two kinds of input files from the NCBI Map Viewer FTP site. The source for human input files is

 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview

which contains subdirectories for the various builds. For example, mapview/BUILD.34.3/seq_gene.md.gz would be an input file for use with the subroutine mk_seq_gene.

At the moment this script will import the files seq_gene.md (essentially records from the Entrez Gene database) and seq_sts.md (the UniSTS database). However there are many other kinds of data available from the Map Viewer FTP site.

This script does not load the gff files into the database. This can be achieved by running the script bp_load_gff.pl with the output files (gff files) from import_ncbi_mv_hs.pl.

The argument 'type' to the option '--type' indicates what kind of Map Viewer file to import.

 type   Map Viewer file
 ----   ---------------
 gene   seq_gene.md. The path of this file can be indicated
        with the --seq_gene option. The script can read
        directly from the compressed version seq_gene.md.gz.

 sts    seq_sts.md. Similary, use the --seq_sts option to specify
        the path.

Options (default)

 --type        Type of file: gene, sts. Explained above.
 --seq_gene    Path for file seq_gene.md, text or *.gz (seq_gene.md.gz)
 --gene_q      Path for file gene.q, text or *.gz (gene.q.gz). See hs_mk_seq_gene.
 --seq_sts     Path for file seq_sts.md, text or *.gz (seq_sts.md.gz)
 --chromosome  Only import records for this chromosome
 --gff         Path of gff file to create (default=seq_gene.gff for type=gene, etc)
 --min_pos      Minimum chromosomal position to import
 --max_pos      Maximum chromosomal position to import

Example:

 perl import_ncbi_mv_hs.pl --type gene --chr 2 --gff seq_gene_chr2.gff

This imports the file seq_gene.gz

AUTHOR

Scott Saccone (ssaccone@han.wustl.edu)

hs_mk_seq_gene

Example:

 hs_mk_seq_gene(-seq_gene=>'seq_gene.md.gz',
                -gene_q=>'gene.q',
                -gff=>'seq_gene_chr1.gff',
                -assembly=>'reference',
                -chromosome=>1,
                -min_pos=>undef,
                -max_pos=>undef
               );

This converts the human Map Viewer file seq_gene.md to gff format. The gff source field is named "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. Optionally, gene descriptions can be obtained from the Map Viewer file 'gene.q' in which case the group field of the gff gets a 'Note' attribute; for example 'Note "similar to beta-tubulin 4Q"'.

Format of seq_gene.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 transcript 14 weight

Notes on the fields:

 featureId: has the form GeneID:n where n is the Entrez Gene ID. This
is sometimes the same as the LocusLink ID but I believe LocusLink
is being phased out and these IDs may not always agree. Features that are
grouped together by a common featureId will have a common group id
in the gff file. Then the transcript aggregator can then be applied.

 featureType: is used to define the method field in the gff
record. The values I've seen are GENE,UTR,CDS and PSEUDO. I think
the current transcript aggregator only recognizes CDS (the
GENE records use the 'transcript' method). Perhaps UTR must
be converted to 5'UTR and 3'UTR somehow.

 groupLabel: the 'assembly' I believe: 'reference', 'HSC_TCAG' or 'DR51'.

Options (default): -seq_gene mapview file with gene locations, text or *.gz file (seq_gene.md.gz) -gene_q mapview file with gene descriptions, text or *.gz file (gene.q.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly which assembly to use (reference)

read_seq_q

Read Map Viewer file seq_q and store the full gene descriptions. Used by hs_mk_seq_gene.

Format of seq_q: tab delimited header at line 1 field 0: GeneID field 7: full description

hs_mk_seq_sts

Example:

 hs_mk_seq_sts(-seq_sts=>'seq_sts.md.gz',
               -gff=>'seq_sts_chr1.gff',
               -assembly=>'reference',
               -chromosome=>1,
               -min_pos=>undef,
               -max_pos=>undef
              );

Convert human Map Viewer file seq_sts.md to gff format. The gff source is 'sts' and the gff method is "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. The group field is of the form 'STS "name"; Name "name"' where name is the featureName field from the Map Viewer file. The group fields will also contain 'UniSTS_ID n' if the UniSTS ID is available in the Map Viewer record.

Format of seq_sts.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 weight

Notes on the fields:

 featureId: has the form UniSTS:n where n is the UniSTS ID.

 groupLabel: see hs_mk_seq_gene.

Options (default): -seq_sts Map Viewer file with sts locations. Can read directly from *.gz file (seq_sts.md.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly assembly to use (reference)