TopoView Glyph
Warning: This software is still in the developmental stage and is distributed "as is", without packaging and in the same exact condition as currently used by FlyBase. You are free to use it, modify and develop further, but proper reference to the original author and FlyBase is required."fb_shmiggle.pm" TopoView (AKA shmiggle) glyph was developed for fast 3D-like demonstration of RNA-seq data consisting of multiple individual subsets. Main purposes were to compact presentation as much as possible (in one reasonably sized track) and to allow easy visual detection of coordinated behavior of the expression profiles of different subsets.
It was found that log2 conversion dramatically changes perception of expression profiles and kind of illuminates coordinated behavior of different subsets. Glyph and data indexer/formatter were in fact modified with the assumption that final data produced by indexer/formatter will always be a log2 conversion of the original coverage, therefore represented by short integer with values in range of 0-200 or so.
Comparing performance (retrieval of several Kbp of data profiles
for several subsets of some RNA-seq experiment) of wiggle binary
method and of several possible alternatives, it was discovered that
one of the approaches remarkably outperforms wiggle bin method
(although it requires several times more space for formatted data
storage). Optimal storage/retrieval method stores all experiment
data (all subsets of the experiment) in one text file, where
structure of the file in fact is one of the most simple wiggle
(coverage files) formats with the addition of some positioning
data (two-column format, without runlength specification, without
omission of zero values). This is the only format which glyph is able
to handle (there are many reasons for that) so any modification
of indexer/formatter _must_ produce exact equivalent of that
format. In my experience, 90% of the debugging with new incoming
data was related to
the problems of that exact format conversion. Example of the formatted
data:
# subset=BS107_all_unique chromosome=2LHet -200000 0 0 0 19955 1 19959 0 19967 2 19972 0 19977 2 20027 0 20031 2 20035 0 20043 1 20045 0 20049 1 20055 0 20062 2 20069 0 20073 2 20082 0 20097 3 20115 0 20125 3 20127 0 20134 3 20139 0 20140 3 20144 0 20145 3 20150 0 20157 3 20162 0 20172 3 20183 0
Glyph is supplied with a "index_cov_files.pl" data indexer/formatter which is converting original coverage (wiggle) files into data structure which will be used for fast retrieval. You should run this script in some separate directory, containing original coverage files (gzipped form works too). After it finishes, directory will contain two new files: data.cat and index.bdbhash. Both files required for data retrieval by glyph. Files can be moved freely between different directories or even operational systems (Mac and PC included, I think). Content of the dat file is subject of accurate check - this is if you want to avoid long debugging sessions on the level of running GBrowse. Size of files is quite big, but in my experience it is like twice less than gzipped size of all initial coverage files - which is quite acceptable.
Example of GBrowse conf file insert (shows actual FlyBase config sections for
Baylor and modENCODE RNA-seq tracks):
[baylor_wiggle] feature = RNAseq_profile:Baylor glyph = fb_shmiggle height = 124 bgcolor = sub { my $f= shift; $f->{datadir}= '/.data/genomes/dmel/current/rnaseq-gff/baylor/'; # trick it this way.. my @subsetsorder= qw( E2-4hr E2-16hr E2-16hr100 E14-16hr L L3i L3i100 P P3d MA3d FA3d A17d ); $f->{subsetsorder}= \@subsetsorder; return 'lightgrey'; } key = Baylor group RNA-seq coverage by subsets (devel.stages) [log2 converted] category = RNA-seq data label = "" title = "" link = sub { my $f= shift; my $id= $f->{'id'}; my $lnk="javascript:void(0);"; "$lnk\" id=\"$id\" onmouseover=\"showdata_description('Baylor');return false;\" onmouseout=\"delsumm_overlib();"; } [celniker_wiggle] feature = RNAseq_profile:Celniker glyph = fb_shmiggle height = 250 bgcolor = sub { my $f= shift; $f->{datadir}= '/.data/genomes/dmel/current/rnaseq-gff/celniker/'; # trick it this way.. my @subsetsorder= qw( BS40_all_unique BS43_all_unique BS46_all_unique BS49_all_unique BS54_all_unique BS55_all_unique BS58_all_unique BS62_all_unique BS66_all_unique BS67_all_unique BS71_all_unique BS73_all_unique BS107_all_unique BS111_all_unique BS113_all_unique BS196_all_unique BS200_all_unique BS203_all_unique BS129_all_unique BS133_all_unique BS136_all_unique BS137_all_unique BS140_all_unique BS143_all_unique BS150_all_unique BS156_all_unique BS162_all_unique BS153_all_unique BS159_all_unique BS165_all_unique ); $f->{subsetsorder}= \@subsetsorder; my %subsetsnames= qw( BS40_all_unique em0-2hr BS43_all_unique em2-4hr BS46_all_unique em4-6hr BS49_all_unique em6-8hr BS54_all_unique em8-10hr BS55_all_unique em10-12hr BS58_all_unique em12-14hr BS62_all_unique em14-16hr BS66_all_unique em16-18hr BS67_all_unique em18-20hr BS71_all_unique em20-22hr BS73_all_unique em22-24hr BS107_all_unique L1 BS111_all_unique L2 BS113_all_unique L3_12hr BS196_all_unique L3_PS1-2 BS200_all_unique L3_PS3-6 BS203_all_unique L3_PS7-9 BS129_all_unique WPP BS133_all_unique WPP_12hr BS136_all_unique WPP_24hr BS137_all_unique WPP_2days BS140_all_unique WPP_3days BS143_all_unique WPP_4days BS150_all_unique AdM_Ecl_1days BS156_all_unique AdM_Ecl_5days BS162_all_unique AdM_Ecl_30days BS153_all_unique AdF_Ecl_1days BS159_all_unique AdF_Ecl_5days BS165_all_unique AdF_Ecl_30days ); $f->{subsetsnames}= \%subsetsnames; return 'lightgrey'; } key = modENCODE Transcription Group RNA-seq coverage (unique reads only) by subsets (devel. stages) [log2 converted] category = RNA-seq data label = "" title = "" link = sub { my $f= shift; my $id= $f->{'id'}; my $lnk="javascript:void(0);"; "$lnk\" id=\"$id\" onmouseover=\"showdata_description('Celniker');return false;\" onmouseout=\"delsumm_overlib();"; }
In configuration, it is very important to set 'datadir' variable (relative to server DOCUMENT_ROOT) so that glyph will know where to take data and index.
Setting 'subsetsorder' allows you to display expression profiles of subsets in some predefined order. If setting omitted, glyph will display sets in alphabetical order of the initial subsets names.
Setting 'subsetsnames' allows to rename subsets (very important as in most cases workflow names of subsets are unsutable for intelligent data display to end users). If setting omitted, initial subsets names will be used for display.
For the glyph to be properly activated, you need to insert in all of your GFF files (ones for which you have RNA-seq data) virtual contig-long features which will activate expression data display. To cover whole range of the contig (chromosome arm), it is better to use coordinates presented in 'sequence-region' definition at the top of GFF file. Example of such feature lines for FlyBase data is shown below:
2LHet Baylor RNAseq_profile 1 368874 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2L Baylor RNAseq_profile 1 23011544 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2RHet Baylor RNAseq_profile 1 3288763 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2R Baylor RNAseq_profile 1 21146708 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3LHet Baylor RNAseq_profile 1 2555493 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3L Baylor RNAseq_profile 1 24543557 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3RHet Baylor RNAseq_profile 1 2517509 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3R Baylor RNAseq_profile 1 27905053 . + . Comment=This is a reference feature for RNAseq wiggle tracks 4 Baylor RNAseq_profile 1 1351857 . + . Comment=This is a reference feature for RNAseq wiggle tracks XHet Baylor RNAseq_profile 1 204113 . + . Comment=This is a reference feature for RNAseq wiggle tracks X Baylor RNAseq_profile 1 22422827 . + . Comment=This is a reference feature for RNAseq wiggle tracks YHet Baylor RNAseq_profile 1 347040 . + . Comment=This is a reference feature for RNAseq wiggle tracks
Questions about TopoView glyph should be directed to Victor Strelets (strelets@bio.indiana.edu).