######## Bio::ToolBox revision history ############# v1.67 - Add new option of smart coverage to script bam2wig that smartly handles pair-end alignments with gaps (introns) - Add capability to collect from multiple datasets at once for scripts get_binned_data and get_relative_data. Summary files can now handle multiple datasets. - Allow specific number of up and down windows in script get_relative_data. - Add option to provide list of specific feature IDs to script get_features. - Write shift correlation region data from bam2wig. - Improve GTF export. - Add utility function to simplify dataset names, used in data collection scripts. Strips path and everything after first period from dataset file names. - Improve sort function in manipulate_datasets by taking a range of columns and sort by mean. Also addname function will overwrite a feature name if present. - Adjust logic for setting a file extension when none is provided. - Lots of additional minor fixes and changes v1.66 - Optimize data2wig fast mode, about 3 times faster - Summary files now use a cleaned-up column name. Fix bugs with summary file generation. - Bam2wig now properly reports alignment counts for each strand when provided with multiple input bam files (previously reported the same number). - Fix bug where the Big adapter would crash when search coordinate was out of bound, unlike UCSC, HTS, and Sam. - Improve GTF export with correct formatting and no longer export transcript lines. - Improve GTF parsing where both transcripts and genes are inferred but coordinates where not updated correctly. v1.65 - Add function to read directly from bigWig files, and add support for bigWig files to script manipulate_wig - Added options for filtering transcript Gencode or biotype in script get_gene_regions. - Added option to discard low count features from script get_datasets. - Add option to explicitly set number of columns of output bed file in script data2bed - Update script get_feature_info to work with annotation files - Optimize data2wig to handle fast option in more scenarios - Coordinate string generation in manipulate_datasets takes start values as is - Bug fixes in Bio::ToolBox, get_relative_data, manipulate_datasets, more - v1.64 - Added support for Encode gappedPeak files. Also support for gleaning file formats from bed track lines. This should make future file formats easier to support in the future. - Fix critical bug with skipping duplicate features from GTF files, particularly from Ensembl where exons share the same exon ID. - Fix double-counting of stranded alignments in bam2wig script. Also correctly set minimum paired-end size. - Fix bug to correctly count FPKM and TPM over length-adjusted features in script get_datasets. - Fix bug with filtering transcripts in script get_features. - Reset and clarify behavior regarding stop codons when parsing and exporting transcript features for various annotation formats. - Add single-letter option support to script get_gene_regions. v1.63 - Added minimal Cram file support through the HTS adapter. Currently only supports the reference fasta listed in the Cram file header. - Added fast paired-end option and paired-end start point options to script bam2wig. Temporary files now written to a temporary subdirectory, which can be specified. Extreme depth can now be handled properly by using 32 bit integers instead of 16. Splice segments can now be fractionally counted. - Brought back and updated old script correlate_position_data to identify positional shifts in nucleosome or ChIP signal peaks. - Added new SeqFeature methods to duplicate objects and delete subfeatures. - Added option to format result numbers in script get_datasets. - Fix numerous small bugs in scripts data2gff, data2fasta, get_intersecting_features, get_relative_data, and more v1.62 - Added Bed parser with support for bed3-12, bedgraph, narrowPeak, and broadPeak files. Data collection files will now parse bed files and write a table with ID and name only, instead of appending data columns to the original file structure. Parsing can be turned off if you prefer the old way. - Added support for writing bed12 transcript models to GeneTools library and get_features script. - Bam file alignment counting now automatically excludes all secondary, duplicate, and supplementary marked alignments. - Add new method to manipulate_datasets to name features, useful for naming bed3 files. - Added TPM option to get_datasets script. - Fix bugs with parsing gff and gtf files at same time - Fix bugs with detecting null and/or empty values, especially when converting data formats - other miscellaneous bug fixes v1.61 - Added genomic sort and bgzip file compression support when writing files for tabix compatibility with several scripts, including those that write gene tables. - Tables generated from parsed gene annotation files (GFF, etc) no longer write a Type column. - Simplified dataset column names in script get_datasets. - Fix transcript filtering bugs in script get_features. - Add helper methods for setting bam and big adapters to db_helper. - Optimized run time by loading db_helper only on demand. - Fix numerous POD bugs. v1.60 - Major update to using Bio::DB::Big module for bigWig and bigBed file support. This should be much easier to install and support than the old UCSC library adapter modules from GMOD. The old UCSC adapter is still supported, however. Also included a wrapper for working with BigWigSet databases, which are too useful to deprecate. - Use File::Which to always locate helper applications. - Add support for pigz when writing gzip compressed files. - Add support for fetching genomic sequence from subfeatures - Add single letter command line options to all scripts. This was vaguely inherently supported before if the option was unique, but now single letters (case sensitive) for common options are explicit, and bundling is available. - Add simple menu descriptions and option grouping to the Synopsis section of every script POD documentation (about time!). - Add new script manipulate_wig.pl. - Add chromosome-specific normalization to bam2wig. v1.55 - Fix bugs in bam2wig script when using negative shift values; thanks to Piotr for reporting. Also fix bug regarding forking in coverage mode; thanks to Naoki for reporting. v1.54 - Update config module to stop writing unnecessary config files. Config file will only be written when updating database or application paths. Removed outdated validation, exclude tags, feature classes, and default window values used by old db_helper methods. - Complete rewrite of get_features script to handle annotation files such as GFF3/GTF/UCSC formats in addition to SeqFeature::Lite databases. Includes additional feature filters based on tags. - Add additional transcript filter methods to GeneTools library, including GENCODE basic tags and transcript_biotype. - Update Data parse_table API, now allows for chromosome skip regex, control simplify option, and explicitly search for mRNAs. - Allow SeqFeature transcript collapsing and length determination to work with features from a database. - Tolerate weird transcript types when collecting subfeatures in various GeneTools functions. - Removed unnecessary primary_tag gene checks when collecting scores. - Record extra ensemblSource data as transcript biotype when parsing UCSC files - Add chromosome skip regex to db_helper and big_helper methods - Add no header options to data convertor scripts - Long overdue update of POD and Readme. v1.53 - Significantly streamlined GTF and GFF3 parsing to improve loading times. By default, no subfeatures are parsed and must be explicitly turned on as needed. - Improved parsing gene tables (GTF, UCSC, etc) as an input file to scripts. Now supports defining both the feature and subfeature types to process. One more reason not to use an annotation database. - Fixed critical bug with collecting data across subfeatures, e.g. get_binned_data. Subfeatures were not being properly parsed and coordinates weren't converted to relative positions correctly. Thanks to Zhizhou for reporting. - New methods in Data objects for collapsing gene transcripts and calculating transcript lengths. - Fix bug with paired-end center span recording in bam2wig. Thanks to Yixuan for reporting. - Summary files now report bin midpoints based on 1000 bp length. - Script pull_features allow multiple groups in a list file, and write only summary files if desired. - Bug fix in collecting sequence. Thanks to Patrick. - Add support for collecting cds Start and Stop in script get_gene_regions - Numerous small bug fixes v1.52 - Added binning option to wig files in script bam2wig. Default is to write wig files in 10 bp bins with significant decreases in runtime and memory usage while not appreciably diminishing resolution. - Add support to calculate shift values without doing wig conversion in script bam2wig - Add support for mRNA transcript subfeatures, including CDS, 5 prime UTR, and 3 prime UTRs, in data collection scripts get_datasets and get_binned_data. - Add new UTR methods to GeneTools library - Changed behavior of reporting common and alternate exons and introns in GeneTools. Genes with single transcripts now report all exons and introns as common for simplicity. - Add option to search at the 5 prime, middle, or 3 prime end of features in script get_intersecting_features - Fix bug in specifying which database feature to collect regions from in script get_gene_regions - Fix bug where tables with coordinates could not be used in database lookups in script get_feature_info v1.51 - Changed how bam alignments are recorded for indexed position data hashes. Alignments are now recorded at their 5' postion instead of midpoint, which wrecked havoc with large gaps and pairs. - Reporting indexed bam alignment names (ncount method) now returns the actual names rather than count. The db_helper calculate_score method can properly count these. This avoids double-counting across exons, etc. - Fix major bug in script bam2wig that prevented paired-end alignments from working. Thanks to Mengyao for pointing this out. - Add additional checks when loading malformed files that have a missing column header or extraneous hidden columns (extra tabs) - Add format checks for numeric columns in some file formats - Miscellaneous code improvements here and there v1.50 - Major upgrade of the data collection libraries to simplify data collection and improve efficiency. The value type is no longer specified, being rolled into the specified collection method. Low level optimizations have been added to improve speed. Increases from 30% to over 300% have been measured, depending on the collection method and adapter. - Rewrite of data collection scripts to work with the improved libraries - Added support for the modern Bio::DB::HTS module for Bam files, while keeping support for the older Bio::DB::Sam module. - Added more agnostic support for multiple different fasta indexing adapters - Script bam2wig is completely rewritten to handle multiple bam files for merging, independent bam scaling, improved alignment filtering, customizable output, improved cross-strand correlation for peak shifting, improved speed and memory management, and lots more features. - Updated script data2fasta - Numerous other features and changes too small to mention - Relaxed requirements for external modules, namely BioPerl, so that scripts and functions that don't absolutely require them can still be used. All database functions will require it though. v1.45 - Fix endless loop bug with opening files with metadata but no data, e.g. empty VCF files - Revert support for opening bedGraphToBigWig file handles v1.44 - Added new function to GeneTools for exporting to GTF format. - Added new function to filter transcript subfeatures in a gene SeqFeature object by available Ensembl Transcript Support Level tags. - Fixed critical bug with collapsing multiple transcripts in GeneTools function that resulted in too many overlapping exons. - Fixed bug in exporting non-coding gene models to UCSC refFlat format. - Other minor bug fixes. v1.43 - Fix bug with unique option in script get_gene_regions where too many regions were being discarded. Thanks to Mengyao. - Fix bug with generating bigWig files in script bam2wig, and restore option to prefer bedGraphToBigWig if so desired - Add option to ignore extraneous attribute tags when parsing GFF and GTF files to reduce memory (simplify). Enable this option by default when parsing annotation files when loading a table in Bio::ToolBox::Data. v1.42 - Changed bigWig convertor method to use primarily the wigToBigWig utility for simplicity - Introduced new method to open a wigToBigWig utility filehandle to "print" wig files directly to a bigWig - Updated bam2wig and data2wig scripts to write directly to the bigWig utility and skip writing temporary intermediate wig file - Added functionality to bam2wig to record stranded shifted counts - Fixed a critical bug in script get_gene_regions where transcripts weren't being filtered - Improved file format taste testing to avoid GFF false positives - Improved UCSC gene table parser behavior v1.41 - Added no header option when loading text files missing a column header row. Updated script manipulate_datasets to take advantage of the feature. - Added option to combine multiple score columns into a single score when converting a file to a wig file in script data2wig - Added option to split gff or vcf data files by an attribute tag in script split_data_file - Improve handling of writing vcf files - Fix critical errors with calculating cdsStart and cdsEnd in the GeneTools library - Fix bugs in gff parser to continue when encountering errors in parsing and interpret transcript biotype gtf attributes - Fix bug in properly handling start coordinates in script data2wig v1.40 - Major update introduces new SeqFeature object Bio::ToolBox::SeqFeature that is a little faster and more compact than equivalent BioPerl objects. This is the default object used in gene table parsers. - New Module Bio::ToolBox::GeneTools for working with SeqFeature objects representing traditional nested feature gene, transcript, exon models. The script get_gene_regions now uses this module, as do other scripts. - Expunged many scripts that are no longer considered part of the primary mission of the BioToolBox distribution. These are now available in a separate repository located at https://github.com/tjparnell/HCI-Scripts. - Bio::ToolBox::Data objects can now parse all gene tables into memory and store the features in the object. This allows gene tables to be used without requiring a database to be setup. - Added a file tasting method to determine whether a file looks like a specific file format, e.g. gff, UCSC gene table, etc. - Added numerous little methods and method aliases here and there to improve functionality - Added attribute rewrite functions for both GFF and VCF files - Improved file format testing - Numerous little optimizations in loading files v1.36 (git 44b9dea) - added new option to script get_relative_data to allow user to specify what feature types to avoid - fix bugs in scripts manipulate_datasets when exporting log2 treeview files and defining x axes in graph_profile - fix annoying bug where manipulate_datasets will not re-show column list - improve data file summarization - some library method optimizations v1.35 (git e489d52) - Add new options for setting dimensions and linear regression lines in script graph_data. - Restored unique option in script data2gff. - New convenience methods for Feature objects. - Fixed bug with smoothing interpolation in get_relative_data - Numerous other bug fixes regarding bed files, column names, file support, warnings. v1.34 (git 5d4803c) - Changed the behavior of automatically converting interbase coordinates to base coordinates upon loading a file, and converting back as necessary when writing. This had the side effect of effectively changing coordinates when writing out nonstandard text files. Conversion is now done on the fly when using the start method of row Features. Start interbase coordinates are now recognized by appending a 0 to the column name. Output files should now look like the input files. - Strand values are not automatically converted upon loading; They are converted as necessary on the fly using the row Feature strand method. - Null values are not automatically converted to internal '.' null values. They are converted as necessary using the row Feature value method to maintain backward compatibility. - Scripts data2bed and data2wig go back to using a Stream input to avoid high memory usage. - Script data2wig now has a fast option to skip lots of checks on values and intervals. This speeds up conversion considerably at the risk of making improper wig files if the source file has issues. - Script join_data_file is considerably faster by simply concatenating data lines without processing or checking. - Script bam2wig has new recording option, mid extend, to record the middle portion of alignments or proper paired-end alignments. Credit to Ohad for recommending. - Add explicit interbase support to scripts data2gff and data2fasta. - Fix critical bug were extensions were not scored properly for coordinate features in script get_binned_data. Thanks to Mengyao. - Fix bam2wig alignment alignment illustrations in POD. Thanks to Ohad. - Bug fixes regarding bed file integrity checking that were introduced in the previous release. v.1.33 (git ba1a70e) - Removed legacy_helper module. All scripts now properly updated to use Bio::ToolBox::Data and related objects. This was the last step of a long process to modernize all of the scripts to use the new libraries. - All data collection modules are now chromosome naming-scheme agnostic, meaning that "chr1" and "1" for chromosome can be used equally, regardless of what the annotation or big data file uses. - Minimal VCF file support is added, including the ability to parse INFO and SAMPLE attributes, and verify some file format integrity. - Significantly improve GTF file parsing. - Improve file format verification, including printing error messages. This should alleviate cryptic reasons for automatic file extension changes. - Tons of bug fixes. See GitHub for a full change log. v.1.32 (git 67749a7) - Fix bug with adding a new column to Data object, particularly when selected from a database. - Fix bugs related to adding, deleting, or modifying columns for a specific file format, such as BED or GFF - Introduce additional Data structure verification tests, including proper strand information, to verify correct file formatting, such as BED and GFF - Fix bugs when writing data files that incorrectly maintained file extensions for a given format even when the structure was no longer valid. - Add support for .bigwig and .bigbed file extensions. - Fix bug with opening fai fasta index and forked databases in script CpG_calculator. v.1.31 (git 9a4e122) - Major addition of parsers for GFF and UCSC gene table formats. This replaces the old gff3_parser and now supports GFF, GTF, and GFF3. Also moved UCSC gene table parsing out of ucsc_table2gff3 and into own parser module, available for all. This supports refFlat, genePred, and knownGene tables. Tests for these parsers are included. - Updated script get_gene_regions to use parsers. - Greatly optimized bedGraph writing from script bam2wig to reduce memory usage. Also ensure that bedGraph is written over entire chromosome. - Fix bugs when sorting and performing math with null, NA, and inf values, especially with script manipulate_datasets. - Fix bug where coverage shifts by 1 bp after each write to fixedStep wig in script bam2wig. Thanks to Magda for reporting. v.1.30 (git 9ab9ff4) - Major upgrade of the Bio::ToolBox::Data library internals. Old data_helper and file_helper modules are gone, and a legacy_helper module added for those programs that still haven't been upgraded yet. Numerous improvements and bug fixes to Data and Stream objects, structure verification, standard file format metadata, file writing, and more. Several new methods have been added too. - Added support for ncount, or name count, of bam files. By counting unique alignment names, we can avoid double-counting of reads in adjacent search areas. Also works for counting paired-end reads. Supported by get_datasets script. - Updated pull_features script to use new Data objects. v.1.26 (git 21c800b) - Removed Extras folder and outdated library functions. These are available as a separate GitHub project, biotoolbox-extra. - Improved GFF3 parser to handle orphans more gracefully, and simplify parsing by adding a next_top_feature function. It is moved out of the db_helper hierarchy, where it never really belonged. - Changed license to exclusively Artistic License 2.0. - Fixed bug when using input files with coordinate information in script get_datasets. Thanks to Mengyao for reporting. - Fixed bug when opening a new Data::Stream not based on a file or data list. v.1.25 (svn 955) - Added a new option to manually specify the extension length and allow new ways to record read coverage in the script bam2wig.pl. A text graphic is included in the documentation to illustrate different methods. - Broke out database and fasta functionality from Bio::ToolBox::db_helper into a separate sub module, which should limit the number of modules loaded at compile time. - Allow main Data feature_type to be specified by command line option, useful when your input file has names of database features but not a type column, for scripts get_feature_info.pl, get_datasets.pl, get_binned_data.pl, and get_relative_data.pl. - Added BED and GFF string export to Bio::ToolBox::Data::Feature objects. - Changed library version reporting for default new Data files. - Fix bugs with setting and removing AUTO metadata properly when opening and writing Data files. - Fix bugs regarding deleting metadata, which had a side effect of adding unwanted metadata to files written by manipulate_datasets. - Added more name possibilities when looking for possible name columns. - Fix bug where a database may sometimes not be opened properly after forking into children in data collection scripts. - Fix bug that prevented statistics from being recovered from child processes in script graph_data.pl. v.1.24001 (svn 940) - Updated tests to catch possible sources of error, including recent UCSC BigFile libraries that power Bio::DB::BigWig adaptors, DB_File required for GFF3 loading into memory database, and path verification in Data metadata. v.1.24 (svn 936) - Added new module Bio::ToolBox::Data::Stream for working with data files line by line instead of loading them into memory. Moved lots of shared methods into Bio::ToolBox::Data::common. - Added explicit file support for UCSC-style refSeq and genePred file formats, as well as Encode narrowPeak and broadPeak files. - Added new value type, pcount, in data collection scripts and library score methods. Features, such as Bam alignments, must be entirely contained within the search region, and not just overlapping as with the count value. - Added improved method for reloading forked children files back into Data objects without having to call external join_data_file script. - Improved forking in data collection scripts, including a delay in the parent after forking to prevent race conditions on fast servers with high fork numbers. - Removed all vanity names to data_helper and file_helper subroutines. All scripts updated to reflect changes. - Improved identification of overlapping features when avoiding neighboring features when collecting relative data. - Optimized Bam score data collection methods. - Disabling bins when writing coverage in bam2wig. - Fix bugs with writing CDT files in manipulate_datasets. - Improved ToolBox::Data::Feature methods to handle internal nulls. - Improved retrieval of sequence list, particularly for SeqFeature::Store databases. - Updated and improved library testing for Data and Stream objects and database interaction. - Fixed bug where negative coordinates would not be accepted when collecting relative coordinates. - Fixed bug where Bam and BigBed databases may not be opened properly in some instances, such as precounting features for RPM scores. - Fix bug where in some cases all database features could be returned with the method get_feature(). - Fix bug were type options is now properly implemented in script get_feature_info. - Fix bug limiting to chromosome length in script get_intesecting_features. v.1.23 (svn 915) - Improved script get_gene_regions to recognize non_coding exons; prompt for region, feature, and RNA type; specify for more than one feature type at a time; and avoid mixing RNA sub types from the same gene. Thanks to Mengyao for troubleshooting. - Fixed bugs pertaining to collecting relative windows that may extend beyond the beginning of the chromosome. Thanks to Nate for reporting. - Fixed bugs sorting by genomic coordinate, especially when only Position is provided and not Start. - Made Bio:ToolBox::Features return smart coordinates only, no funny values. v.1.22 (svn 906) - Added new export options of alternate, common, or all exons to script get_gene_regions. - Changed behavior of Bio::ToolBox::Data::Feature such that database features must now be explicitly retrieved rather than automatically retrieved, which could lead to runaway execution if it could not be found. - Improved how name columns are recognized and used when retrieving database features. - Improved writing of strand information in proper format for Bed and GFF files. - Fixed numerous bugs that prevented proper execution in several scripts, including manipulate_datasets, get_feature_info, graphing scripts. Thanks to Mengyao and Yixuan for reporting. - Standardize data file loading message among several scripts. v.1.21 (svn 896) - Fixed critical bug that prevented upstream windows from collecting data in script get_relative_data. - Fixed critical bug that prevented some bigBed files from being opened. - Fixed critical bugs that prevented scripts data2fasta and get_intersecting_features from working properly. - Fixed bugs where strand may be inappropriately assigned or sometimes ignored when collecting a regional positioned scores. - Fix minor bugs in output of scripts ucsc_table2gff3 and get_ensembl_data - Include checks in data collection scripts to exit gracefully if datasets can't be verified. - Interactive list of values to keep or toss is now sorted alphanumerically in script manipulate_datasets. v.1.20 (svn 884) - Refactored db_helper so that all database adaptors are loaded dynamically only as needed during runtime, rather than loading everything all at once regardless of need. This results in faster load times and reduced memory footprint. - Added new methods to Bio::ToolBox::Data objects, including sorting, genomic sorting, and feature_type. - Split out metadata-related methods and Feature objects as separate modules in Bio::ToolBox::Data. Feature objects will now automatically retrieve represented database features as necessary to collect attributes. - Rewrote many, many scripts to use Bio::ToolBox::Data objects. Simplify, unify, and improve all Data functions. - Moved many specialized, outdated, or esoteric scripts to an optional extras folder that will no longer be distributed via CPAN but will be available through SVN. - Added new functions to script manipulate_datasets.pl, including processing rows with specific values, split and concatenate columns, view table contents, and add additional manipulations prior to writing CDT files. Also, several old functions were removed. - Added support for converting refFlat and simple genePred file formats to GFF3 in script ucsc_table2gff3.pl. - Add better warnings for reading files with DOS or MAC line endings. - Removed file extension manipulation in join_data_file script. - Replaced fatal errors with warnings in merge_datasets script. - Fix critical error where midpoints were not calculated correctly for features in script get_relative_data.pl, preventing data collection around a feature midpoint. - Fix bug to properly collect extended bins at 3'end and avoid undefined start errors in average_gene.pl; plus write a summary file when executing with forks. - Fix bugs with collecting features from a database. - Fix bug with renaming M to UCSC-style chrMT in get_ensembl_annotation. - Numerous other small fixes scattered about. v.1.19 (svn 843) - Implemented subfeature sharing and multiple parentage when exporting UCSC tables as GFF3. For example, exons can now be shared between multiple transcripts of the same gene. This leads to considerable reduction in file size at the expense of increased complexity. Naming of subfeatures is now optional. - Renamed script print_feature_types.pl to simply db_types.pl. Known databases in the configuration file can now be interactively chosen from a list. - Added support for multiple parentage in the gff3 parser library and script gff3_to_ucsc_table.pl. - Added a verbose option and improved path detection in script db_setup.pl. - Script filter_bam.pl now works on unsorted and non-indexed bam files, making it more useful than before. - Bam files opened using db_helper::bam may now be sorted as necessary before indexing. - Increase default buffer value in script bam2wig.pl. - Fixed bug where firstExon features were misnamed as lastExon in script get_gene_regions.pl. v.1.18 (svn 826) - Fixed critical bug when calculating RPM and RPKM values in data collection scripts. This is a long-standing bug that produced erroneous values. The bug does not affect bam2wig.pl rpm reporting. - Improved methods for collecting from subfeatures such as exons of genes or transcripts in script get_datasets.pl. - Added option to specify which UCSC table(s) to use when setting up a new database in script db_setup.pl. - Added new options to extend and concatenate sequences in script data2fasta.pl. - Added ability to use the samtools fasta index when available in scripts data2fasta.pl and CpG_calculator.pl. This index is about 10-20% faster than the BioPerl fasta index. - Fixed bug to avoid illegal characters in filenames when splitting data files, and added an option to use a custom file prefix in script split_data_file.pl. - Fixed bug where ensembl gene names may not be properly recorded in the output GFF3 file in script ucsc_table2gff3.pl. v.1.17 (svn 808) - Added six new method functions to Bio::ToolBox::Data for working with columns and metadata. - Updated script correlate_position_data.pl with parallel execution plus an ANOVA statistical analysis between data. - Fixed bug where the --bwapp option was not being used in script bam2wig.pl. Thanks to Michael D. for reporting. - Removed extraneous BioPerl warnings when opening a fasta file or directory fails, and replaced with some suggestions. - Fixed bug with RPM option that lead to warnings in db_helper. - Simplified warning for duplicate lookup values in script merge_datasets.pl. - Reorganized the POD summary and provided examples of usage for main data collection scripts, plus provide default values in POD summaries for a number of scripts. Thanks to Christian for the recommendation. v.1.16 (svn 794) - Fixed critical bug that prevented the forward strand from being written when generating stranded coverage in script bam2wig.pl. Thanks to Michael D. for reporting. - Fixed critical bug that prevented the script get_bam_seq_stats.pl from compiling properly. - Fixed bug that prevented filtering more than one length at a time in script filter_bam.pl. Thanks to Yixuan for reporting. - Fixed again the bug where passing a negative or zero start to data collection methods issues a warning and resets the value to 1 in db_helper. v.1.15 (svn 786) - Added Bio::ToolBox::Data method to delete column metadata and improved adding new metadata. - Added back cached database objects for data collection, which brings back speed lost in the previous version. - Original strand format is now maintained when rewriting data files. For example, + and - from Bed and GFF files as opposed to 1 and -1. - Passing a negative or zero start value to data collection methods in db_helper now issues a friendly warning and resets the value to 1. - Opening a BigWigSet directory of bigWig files can now infer strand based on filename and set the metadata appropriately. For example, files whose basename ends in f, forward, or plus will be interpreted as strand 1. - Script gff3_to_ucsc_table.pl was significantly updated to address critical flaws and change the output format to refFlat. - Script manipulate_datasets.pl no longer writes metadata for simple file formats when using certain functions that do not change data content. - Script bam2wig.pl now includes a --flip strand option. - Scripts graph_data.pl and graph_profile.pl have fixed errors and made improvements regarding fonts and sizes. - Various other small bug fixes and checks for optional Perl module installs. - Updated shebang lines to use universal /usr/bin/perl - Updated script POD documentation to make common options more uniform. v.1.14.1 (svn 763) - Changed the method of caching database objects introduced in version 1.14, which wreaked havoc with forked child processes. All database connections are cached by default and returned if subsequently re-opened, unless explicitly told to not use the cached connection. Multiple scripts were updated to reflect the new connection caching. - Bio::ToolBox::Data now automatically re-clones existing database connections if you splice the data table. - Bam file index files are now explicitly generated prior to opening the bam file database connection. Additionally, existing .bai files are copied as .bam.bai in preference to creating a new .bam.bai file. Thanks to Yixuan for reporting. - Fixed POD errors in script bar2wig.pl and updated method for finding the java executable file. Thanks to Guillaume for reporting. - Removed debugging warn statements in script get_relative_data.pl. - Added POD documentation to Bio::ToolBox::db_helper::useq. v.1.14 (svn 737) - Massive reorganization of the entire package into a proper Perl module distribution that is installed using standard Module::Build methods. This will install the libraries into site-specific Perl library directories as Bio::ToolBox::*. Scripts will install into a standard bin directory. All scripts have been updated to reflect these changes. - Added new module Bio::ToolBox::Data, which provides an easy object-oriented interface to working with data files and the rest of the Bio::ToolBox functions. - Added new script db_setup.pl to ease generating an annotation database with UCSC data - Added Build tests for all major library functions, including score collections from all binary database adaptors. - Added capability to properly collect value types, including score, count, and length, from useq and wiggle database adaptors - Loosened restriction for counting Bam alignments where the midpoint had to be within the query region; now any overlapping alignment that intersects the region will be counted. - Reworked the interpolation algorithm to interpolate as many datapoints as possible in script get_relative_data.pl. - Removed cryptic error messages when opening databases, and added database handle caching to avoid repeated openings - Newly generated feature lists no longer append all aliases to the feature name - Added additional attributes to the list of available ones to retrieve from the database in script get_feature_info.pl. Also added a --type command line option to set a feature type to named features. - Improved data table checking to include a count of columns for every row. - Added max_count option to script bam2wig.pl to control for high Bam coverage - Fixed bug where the summary file was not created for script get_relative_data.pl v.1.13 (svn 691) - Updated to include native support for USeq archive files with data collection scripts. USeq files may be used in the same manner as BigWig, BigBed, or Bam files for data collection. USeq files may be generated using tools from the USeq package (useq.sourceforge.net). The Bio::DB::USeq adaptor is available via CPAN. - Added new script filter_bam.pl, which can filter alignments based on various criteria and write a new Bam file. Filters are one or more boolean tests, including attributes, scores, lengths, sequence, etc. - Added new script get_bam_seq_stats.pl, which collects information about the read sequences themselves and summarizes the sequence composition and nucleotide frequencies, suitable for generating sequence logos. - Updated script manipulate_datasets.pl to allow any integer to be used when formatting decimal values. - Restored ability to write a new data file without collecting data from script get_datasets.pl. - Changed the log conversion step to avoid having to increase read count by 1 to avoid log of 0 errors in script bam2wig.pl. - Use the command line --log argument in preference over metadata in script manipulate_datasets.pl. - Method sum now writes 0 instead of null in script bin_genomic_data.pl. - Fixed issue where joining data files may not maintain gzip status. This had issues with combining forked children files. - Fixed bug where a provided, indexed data source file (e.g. BigWig) could not be used as a database in script get_datasets.pl v.1.12.6 (svn 680) - Updated the script novo_wrapper.pl to use Parallel::ForkManager instead of GNU Parallel. This should make it more stable, particularly under nohup. - Consolidated the standard out results when functions were applied to multiple columns in script manipulate_datasets.pl. This will make the script much less chatty. - Fixed bug with naming temporary forked children file names. - Fixed bugs with the generation of summary files. - Fixed bug with the automatic identification of the X axis in script graph_profile.pl. - Fixed bug where features not found in a database could crash the script get_feature_info.pl. v.1.12.5 (svn 667) - Improved the shift value determination to make it more robust against outliers in script bam2wig.pl. Additionally, the model data that is written is now centered over the shift peak to make evaluations more interpretable. - Fixed a bug where 0 or negative coordinates may be written to varStep wig files in script bam2wig.pl. v.1.12.4 (svn 662) - Improved the efficiency of scanning for high coverage regions and calculating 3 prime shift values in script bam2wig.pl; Each reference sequence is now scanned in parallel. Also added a new option to write the shift profile model and correlation data. The efficiency of writing bedGraph files was improved, giving up to 2X increase in performance. The default maximum duplicate value is now unlimited. Warnings about coverage beyond the ends of chromosomes are now silenced unless verbose is turned on. - The script graph_data.pl can now execute in parallel to improve efficiency when a list of datasets are provided in advance. A list may now be provided in conjunction with the --all option. - Improved recognition of the X-axis column in script graph_profile.pl. - Fixed critical error when writing extended position bedGraph files from script bam2wig.pl where reverse reads were not extended appropriately in the 3 prime direction. v.1.12.3 (svn 651) - Added user options to control the size of the memory buffer when writing bedGraph files and the disk write frequency in script bam2wig.pl. - Added option to control the output order of the features from script pull_features.pl. The order may match either the input list or input data file. Also improved automatic column identification and avoid empty output files. - Script data2wig.pl will now write bedGraph files. - Fixed bug leading to excessive memory usage when writing a fixedStep wig file from script bam2wig.pl. Thanks to Jeff for reporting. - Fixed bug where writing strand values for gff or bed files may not be written correctly. - Fixed bug leading to errors loading input files with comment or empty lines in the middle of data lines. - Fixed bug to avoid log of 0 errors in script bam2wig.pl. v.1.12.2 (svn 642) - Scripts find_enriched_regions.pl and CpG_calculator.pl are now multi-threaded. The find_enriched_regions.pl also has additional optimizations to reduce memory usage. - The script merge_datasets.pl now has the option to use a coordinate string as a unique identifier when looking up features. This is particularly helpful with BED, GFF, and other files with genomic coordinates that do not have unique name identifiers. - A coordinate string in the format chromo:start-stop may now be generated from coordinate values in data files using a new function in the script manipulate_datasets.pl. - Fixed a bug regarding changing file extensions in script join_data_file.pl, which gave odd output file names with scripts that executed in parallel. v.1.12.1 (svn 635) - Fixed bugs were gzip status and file extensions may be inappropriately inherited. This may cause problems when joining children files from parallel process forks. - Fixed bug where the interactive menu would exit upon an empty value in script manipulate_datasets.pl. A "q" must now be provided to exit. - Minor optimization when calculating shift values in script bam2wig.pl. v.1.12 (svn 619) - Major improvements to performance of some data collection scripts by adding multi-threaded options. These include get_datasets.pl, get_relative_data.pl, average_gene.pl, and bam2wig.pl. The number of CPU forks may be specified with the --cpu option (default 2). This option requires the installation of Parallel::ForkManager, available through CPAN. Run the check_dependencies.pl script to install it. - All gzip compression read and writes are now forked through an external gzip utility for a considerable boost in performance (2-5X). The gzip executable must be in your path for this to work (it usually is on most Unix-like environments). - Added --long option when collecting data from long features in script average_gene.pl. - Improved efficiency when collecting data from very large windows in both get_relative_data.pl and average_gene.pl. - Summing the total number of read alignments in Bam files is also multi-threaded. Summing the total number of intervals in a BigBed file is also improved. - Fixed a critical error where not all windows had data collected when using the script get_relative_data.pl v.1.11 (svn 603) - Major revision of how features are now retrieved from the database using primary_IDs rather than relying on unique names in the database. Generating lists of features will now return Primary_ID, Name, and Type. The Primary_ID is unique to a database and is usually non-portable. Current feature lists with only Name and Type will still work, and are subject to limitations of non-unique Names in the database. This affects all scripts that work with database features, including get_features.pl, get_feature_info.pl, get_datasets.pl, get_relative_data.pl, average_gene.pl, get_intersecting_features.pl, and correlate_position_data.pl. - GFF3 annotation scripts get_ensembl_annotation.pl and ucsc_table2gff3.pl now produce GFF3 files that better match the GFF3 specification. Names are no longer made unique (which broke ties with the originating data), proper Dbxref tags are attributed when external sources could be identified, and chromosomes are now sorted by name. Other minor improvements were also made. - Fixed critical bug that prevented spliced alignments from being counted in script bam2wig.pl. Thanks to Pinal K. for reporting. v.1.10.3 (svn 597) - Unified column names and improved their recognition in scripts get_feature_info.pl and the graphing scripts graph_data.pl, graph_histogram.pl, and graph_profile.pl. - Graphing scripts now write the output graph directory in the input file parent directory instead of the current directory. v.1.10.2 (svn 591) - Added a new option of position when adjusting coordinates of retrieved features using the script get_features.pl. Coordinates may be adjusted at the 5 prime, 3 prime, or both ends of stranded features. This also fixes bugs where collected features on the reverse strand with adjusted coordinates were not reported properly. - Improved automatic recognition of the name, score, and other columns in the convertor scripts data2bed.pl, data2gff.pl, and data2wig.pl. - Improved the Cluster and Treeview export function in script manipulate_datasets.pl. The CDT files generated now include separate ID and NAME columns per the specification, and new manipulations are included prior to exporting, including percentile rank and log2. - The convert null function now also converts zero values if requested in script manipulate_datasets.pl. - Added new option of a minimum size when trimming windows in the script find_enriched_regions.pl. - Increased the radius from 35 bp to 50 bp when verifying a putative mapped nucleosome in script map_nucleosomes.pl, leading to fewer overlapping or offset nucleosomes. - Added new option to re-center offset nucleosomes in script verify_nucleosome_mapping.pl. Also improved report formatting. - Added checks and warnings when writing file names longer than 256 characters. Some scripts automatically generate file names that may exceed this limit, preventing writing. File names are now truncated. Thanks to Adam F. for reporting. - Added new methods and code improvements to the gff3 parsing library. - Fixed a bug in script merge_datasets.pl where the column index for a second file may not be properly validated leading to premature termination. - Fixed a bug where multiple datasets combined with an ampersand for merging were not properly verified. - Fixed a bug where a user may not be prompted to select a dataset from a database if none was supplied from the command line. - Fixed a bug where files containing trailing nulls do not load properly. - Fixed a bug related to finding specific data columns by name. - Fixed a bug with writing summary files. v.1.10.1 (svn 568) - Added support for Bio::DB::Fasta in the main BioToolBox library, and added the support to scripts data2fasta.pl and CpG_calculator.pl. Any BioToolBox program that requires chromosome information or sequence can now use a genomic multi-fasta or directory of fasta files in the --db option. - Fixed critical error in data2gff.pl that prevented files from being converted to GFF format. - Fixed critical error merge_datasets.pl that prevented column headers from being written to the output file. - Made the warning about unavailable files on the UCSC FTP server less scary in the script ucsc_table2gff3.pl. - Updated and clarified some script documentation. v.1.10 (svn 559) - Significantly improved performance when collecting data from Bam files by using a low level API. Improvements of at least 2X may be realized. - Significantly improved the performance of the bam2wig.pl script by at least 2X. Added a new option of recording extended regions across the predicted fragment based on empirically determined shift values. Sampling to determine shift values has been increased. BedGraph files are now written more efficiently. Maximum number of identical reads are now enforced. - Significantly improved the performance of the split_bam_by_isize.pl script to increase speed by at least 2X. Added an option to skip checking of mates. Improved reporting of results. - Added a filter option to remove overlapping nucleosomes in script verify_nucleosome_mapping.pl; also fixed bugs in reporting offset distances and improved output reporting. - Removed confusing separate scan and tag datasets required for script map_nucleosomes.pl. Cleaned up and organized code. Fixed bugs that prevented datasets from being validated. - Fixed critical bug where data was not collected for the final row in script get_datasets.pl. - Fixed bugs with parsing unusual input files, for example commented header lines in bed files or inconsistent column numbers. - Fixed bug in script get_intersecting_features.pl where a strand column was expected even if it was not present. - Changed all tim library calls to use arrays instead of anonymous hashes for a cleaner API. - Changed shebang lines to use /usr/bin/env to improve portability on systems with different Perl versions installed. - Cleaned up and made POD documentation more consistent. - Add warnings about database users and passwords in configuration file. v.1.9.7 (svn 539) - Fixed critical bug where an exon containing all three 5'UTR, CDS, and 3'UTR was not properly parsed in the script get_ensembl_annotation.pl. New command line options for to include or not CDS, UTR, and start/stop codons were added. Significant changes to improve and organize the code was also made. - Changed the method of assigning the GFF type for chromosomes and scaffolds based on their name in the script ucsc_table2gff3.pl. Also made the inclusion of start and stop codons enabled by default. - Removed annoying automatic column assignment for input GFF files in script data2bed.pl. GFF files are still handled properly if no columns are specified on the command line. v.1.9.6 (svn 533) - Fixed critical bug in script ucsc_table2gff3.pl where single exons containing all three 5'UTR, CDS, and 3'UTR subfeatures were not properly parsed into GFF3. This had resulted in an extended CDS longer than expected. Thanks to H. Stovall for reporting. - Added warnings when a sequence could not be generated to avoid division by 0 errors, and a slight correction to fraction calculations, in script CpG_calculator.pl. v.1.9.5 (svn 525) - Changed the non-intuitive --except option to a more intuitive --zero option in script manipulate_datasets.pl; this is now a boolean option to include or exclude zero values when calculating statistics. The printed statistics output has also been cleaned up and no longer includes decimal formatting. The export function will automatically generate a name when executed automatically. - Added capability to use a column of source values rather than a static text string for the GFF source tag in script data2gff.pl. Also made improvements to the interactive ask session. - Added the capability to use a big file dataset as the database for chromosome information in script find_enriched_regions.pl. - Added an option to automatically convert the output file to a BED file in script get_gene_regions.pl, and included a description of the --in option in the POD documentation. v.1.9.4 (svn 519) - Fixed first critical bug in script get_datasets.pl where strand information in input files with genomic coordinates (e.g. BED files) was not considered when adjusting coordinates (start, stop, or fractional). - Fixed second critical bug in script get_datasets.pl where collecting fractional data for named database features resulted in data collection over the entire feature. - Improved interpretation of input file features as genomic regions or named features in script get_datasets.pl. - Changed the --set_strand option to --force_strand in multiple data collection scripts. This should make the function a little more obvious as to its purpose. Documentation changed as appropriate. v.1.9.3 (svn 516) - Fixed bug where wig definition lines may not be written when no alignments exist in the first 2 Mb of a chromosome when converting a bam file to a wig file in script bam2wig.pl. Definition lines are now always written. Thanks to Matt J. for reporting. - Fixed bug where the format_with_commas sub was not properly imported into the tim_db_helper library - Fixed bug where the bed output from script get_features.pl did not properly report strand information. v.1.9.2 (svn 510) - Fixed critical bug where codon changes were not reported correctly for minus strand genes in script locate_SNPs.pl. Thanks to Craig K. for reporting. v.1.9.1 (svn 507) - Added critical code to interpret strand information from input files such as Bed and GFF into BioPerl standards. Essential for collecting stranded data. Also properly writes back strand information for valid Bed and GFF files - Updated and unified internal library methods for validating and requesting database feature types. By default, all database features are presented to the user as a list when selecting database features to collect data. The source_exclude parameter in the biotoolbox.cfg configuration file is now deprecated. - Upgraded script get_intersecting_features.pl to automatically recognize input file columns and search for more than 1 feature type - Fixed bug in script get_datasets.pl where the program will not continue when only a data database was provided - Fixed bug of requesting index when using a .kgg file as a gene list in script pull_features.pl - Fixed bug in generating file name for Treeview export function in script manipulate_datasets.pl - Fixed behavior when reading files to prevent adding the current program name to the metadata when the input file does not have this metadata - Minor updates to script novo_wrapper.pl v.1.9.0 (svn 493) - Added new script get_features.pl which generates a list of features for one or more feature types from a database. Information about the features may be returned, including name, type, and coordinates. Sub features may be included. The data may be written as a BioToolBox formatted text file, GFF or BED. - Added new script correlate_position_data.pl that calculates a Pearson correlation between the score values at identical positions along a feature between two datasets. This helps in identifying changes in spatial distribution of values. An option for calculating shifts is also available. - Improved Big File generation such that Bio::DB::BigWig or Bio::DB::BigBed is no longer required just to generate the big file, as conversion uses external utilities anyway. - Fixed generation of bin values when calculating distribution frequencies in scripts data2frequency.pl and graph_histogram.pl v.1.8.7 (svn 487) - Added new command line options to script merge_datasets.pl to control the program's behavior. The "--lookupname" option allows you to specify the name of the lookup column, while "--manual" turns off all automatic guessing of columns. Also improved handling of original_file metadata. - Added a new option to collect data from long features (such as genomic annotations) instead of point data (microarray or sequence data) in script get_relative_data.pl. - Added option to convert to and from Roman numerals in chromosome names and support for wig files in script change_chr_prefix.pl - Added option to change the IP port number when connecting to a remote MySQL database host in script get_ensembl_annotation.pl - Fixed bug to properly close opened files in script split_data_file.pl and avoid unnecessary error messages. - Modified statements and warnings regarding step and span values in script data2wig.pl v.1.8.6 (svn 477) - Added numerous enhancements and bug fixes to script data2wig.pl, including automatically assigning the span parameter in the wig file, identifying coordinate columns, adding command line options for coordinate columns, and updating the POD documentation - Improved the treeview export function in script manipulate_datasets.pl to include different manipulations, including median center of genes or datasets, converting to Z-scores, and converting null values. Also changed the default output name to <basename>.cdt. - Added advanced option to script merge_datasets.pl to specify the column order on the command line instead of interactively. Also increased the number of columns that can be specified as letters. - Added the "value" command line option to specify the type of data to collect to the script find_enriched_regions.pl. Also added the sum method plus some improvements for identifying depleted regions. - Updated the script run_cluster.pl to accept any file name as input, and added basic file format validation checks prior to running the cluster algorithm, among a few other minor improvements - Improved handling of error messages when attempting to open databases that do not exist or can not otherwise be opened. - Added more support for reading bedgraph files, dealing with track lines and possibly empty lines - Collecting data from bigWig files that use spanned features (span > 1 bp) are now collected at every base rather than just the start position - Fixed bug where more than two files were not properly merged using lookup in script merge_datasets.pl - Fixed bug to allow data to be collected for Bed files from indexed data files without specifying a database in script get_datasets.pl v.1.8.5 (svn 461) - Fixed critical bug where all knownGene feature strands are reversed in script ucsc_table2gff3.pl - Fixed critical bug where the sign is flipped when generating Z-scores with script manipulate_datasets.pl - Added new functions "convert null values" and "absolute value" to script manipulate_datasets.pl - Added additional file format checks when writing formatted files including GFF, BED, and SGR. File extensions may automatically change to default txt if the format does not match. - Better handling of input Bed files and generating appropriate default file names in script data2gff.pl - Improved merging of datasets by lookup, and loosened restrictions on metadata checking, issuing warnings instead, in script merge_datasets.pl - Loosened restrictions on metadata differences and failures in script join_data_file.pl - Included fix for finding column indices when name is prefixed with # - Added another check to avoid returning undefined values from BigWig data collection v.1.8.4 (svn 448) - Changed shift value determination to use trimmed mean to avoid outliers, and added new option to control the minimum acceptable R^2 value in script bam2wig.pl - Improved script merge_datasets.pl to identify appropriate lookup columns automatically and successfully merge more than two files using lookup - Changed my implementation of Z-score generation so that signed values are properly reported instead of absolute values in script manipulate_datasets.pl - Fixed critical bug where output files were prematurely closed when splitting a data file in script split_data_file.pl - Reduced some unnecessary error reporting when opening databases that do not exist - Updated list of column names to avoid in script graph_data.pl - Updated interactive prompts in script manipulate_datasets.pl - Fixed bug where the --pos option in script_datasets.pl did not accept the 'm' argument - Fixed bug where strand was reported as '.' instead of '0' in script get_feature_info.pl - Fixed bug regarding writing headers, especially with new BED files - Fixed bug when providing an index of 0 on the command line with script manipulate_datasets.pl v.1.8.3 (svn 431) - Improved mapping efficiency, made tag dataset optional, added direct support of BigWig and BigWigSet datasources, and updated documentation to script map_nucleosomes.pl. - Updated script verify_nucleosome_mapping.pl to accomodate changes in map_nucleosomes.pl output, added support for generic input files, added option for other datasources, and added direct support for BigWig and BigWigSet datasources. - Added multiply and add methods to script manipulate_datasets.pl. - Added firstIntron and lastIntron to list of regions to collect in script get_gene_regions.pl - Fixed critical bug when collecting data about GFF features from a database that caused a crash when no features were found. - Fixed bug in get_gene_regions.pl when collecting introns where the last intron was skipped and reverse strand coordinates were flipped - Fixed bugs in manipulate_datasets.pl where a list of invalid index numbers could still evaluate to index 0, and the start column may not be recognized when performing a genomic sort. - Fixed bug where text files with DOS/Windows line endings (CRLF) were not loaded properly - Fixed bug in data2wig.pl to skip positions less than or equal to 0 - Improved null value reporting when collecting data v.1.8.2 (svn r411) - Added new script CpG_calculator.pl to count observed and expected CpG dinucleotides across a genome sequence or defined regions. - Added R61 SacCer2 to R64 SacCer3 conversion to script convert_yeast_genome_version.pl. Also improved chromosome name recognition and identification of columns in custom file structures. - Fixed and improved bin generation and output in scripts data2frequency.pl and graph_histogram.pl. Values outside of the requested range are now ignored. Script data2frequency.pl also has considerable code cleanup and reorganization. - Added a sum method and made minor enhancements to wig data collection to script bin_genomic_data.pl, along with considerable code cleanup. - Added automatic capability to script merge_datasets.pl. All unique columns are automatically merged without manual interaction. This is now useful for automated shell scripts. - Enforced no compression when generating bigWig files, and improved column recognition in script data2wig.pl - Changed 'primary_tag' to 'type' in the generated metadata and subtrack selection for BigWigSet database output in script big_file2gff3.pl. Also improved conf stanza renaming scheme for BigWigSets. - Fixed bug in script bar2wig.pl that prevented the USeq App Bar2Gr from being used. v.1.8.1 (svn r392) - Updated script find_enriched_regions.pl to handle separate feature and data databases if desired, and add capability to restrict searches to specific strands. - Updated script map_transcripts to handle chromosomes names without integers in their names - Brought script convert_yeast_genome.pl back out of retirement and updated with R63 to R64 convertor - Added chromosome and sequence sorting to GFF3 output from script get_ensembl_annotation.pl. Also include Ensembl API version reporting. - Updated script check_dependencies.pl to report the installed Ensembl API version number - Improved GFF3 parsing and minor improvements to script gff3_to_ucsc_table.pl - Fixed bugs when working with BigWigSet databases, where a trailing slash in the directory name may lead to different behaviors, and unexpected results when collecting data from BigWigSet databases using two different methods in the same program - Fixed bug where null values in tab-delimited text files are now internally converted to null character . - Fixed sorting issues in script split_bam_by_isize.pl - Fixed bugs in script novo_wrapper.pl that prevented an uncompressed Fastq input file from being split properly, split input files from being removed after aligning, and a single unsorted Bam file is not further processed v.1.8.0 (svn r378) - Moved script novo_wrapper.pl out of retirement (due to popular demand) and significantly updated it to handle parallel execution - Retired old script merge_SNPs and replaced it with new intersect_SNPs.pl script, which is an improved version that uses the VCF format. - Updated script locate_SNPs.pl to work with multiple alternate sequences, multiple features, and importantly with the VCF format - Added .vcf and .bdg extensions as properly recognized file format extensions. Changed default bedgraph extension to use .bdg in script bam2wig.pl - Stripped all code and mention of binary tim_data_formatted files based on Storable. Not really a prominent feature and never lived up to its hype anyway, so removing it v.1.7.4 (svn r363) (not released) - Fixed critical bug that prevents local Bam files from opening for data collection - Added warnings if a chromosome segment failed to be found in a database v.1.7.3 (svn r355) - Fixed bugs in script bam2wig.pl that prevents it from finding its libraries and compiling properly; and another bug that prevented stranded start positions from being recorded properly v.1.7.2 (svn r351) - Fixed bug in script ucsc_table2gff3.pl where the output file name may not be properly generated, leading to an overwrite of the input file. - Fixed bug in script bam2wig.pl where the recorded position is off by 1 bp - Added recommended settings in the POD for bam2wig.pl v.1.7.1 (svn r346) - Fixed critical bug in data collection library that allowed too many datapoints to be collected by ignoring the stop position. This could affect scripts get_datasets.pl, get_relative_data.pl, average_gene.pl, find_enriched_regions.pl, and others. - Major overhaul of script pull_features.pl to include better automatic identification of identifier columns, the capability to match multiple features, and to simultaneously write all groups from a .kgg list - Updated script get_datasets.pl so that it would rewrite the output file after each round of data collection. - Minor bug fixes in script find_enriched_regions.pl - Retired outdated script convert_yeast_genome_version.pl. Users should use the liftOver program from UCSC and chain files from SGD. v.1.7.0 (svn r340) - Added new program get_gene_regions.pl which helps in retrieving regions not explicitly annotated in a database, including start and stop sites of transcription and introns. - Added new program data2fasta.pl which generates a multi-Fasta file from a tab-delimited text file of coordinates or a list of sequences, such as microarray probes. - Added new program compare_subfeature_scores.pl which compares a list of feature and subfeatures and find the subfeature with the minimum and maximum score. - Major update to the data collection scripts to improve memory consumption and efficiency, and a significant boost in speed when working with BigWig data sources (I have seen up to 10 fold increase, depending on collection methods). - Improvements when working with BigWigSet directories, including working with impromptu directories of BigWig files that do not have a defined metadata file. - Added the option of using separate annotation and data databases when using the data collection scripts. This greatly simplifies things when you have, for example, an annotation SeqFeature::Store database and a BigWigSet database of data. - Added the rpkm method to work with any segment, not just genes with exons, in data collection scripts get_datasets.pl and average_gene.pl - Fixed bugs in script ucsc_table2gff3.pl, data2wig.pl, find_enriched_regions.pl, and bar2wig.pl v.1.6.4 (svn r314) - Major update to script bam2wig.pl to reduce memory consumption by writing incremental portions. The strand option is now a boolean option, and when enabled, automatically writes both strands simultaneously. The binning of read counts into windows of user-selected size is now possible. The optimal shift value for ChIP-Seq data can now be empically determined from the reads using a statistical method. - Added additional support for UCSC ensGene tables by including ensemblToGeneName and ensemblSource supplemental tables in script ucsc_table2gff2.pl. The common gene name is now included in the output GFF3 file. - Added rna_count function to script get_feature_info.pl - Added minimum and maximum value functions to script manipulate_datasets.pl - Included a range option when generating a summary file in script manipulate_datasets.pl - Improved the regular expression matching of the chromosome name when sorting by genomic coordinates in the script manipulate_datasets.pl - Increased the number of available letters when requesting indices from the second file in script merge_datasets.pl - Updated script check_dependencies.pl to handle missing dependencies more gracefully - Updated error handling of missing Perl module dependencies, including IO::Zlib - Fixed bug where the default chromosome exclusion list in biotoolbox.cfg wasn't being used when generating a new genome interval list - Fixed bug where where a script might ignore the --nogz option when the original file was gzipped - Fixed bug in script split_data_file.pl where a filename may get out of sync with what was requested and what is written v.1.6.3 (svn r293) - Added knownGene as a source in script ucsc_table2gff3.pl - Improved handling of the chromosome exclusion list in library tim_db_helper - Fixed bug where an exception could occur if multiple genomic regions on different chromosomes are returned from a database query. Included logic to help identify the appropriate intended chromosome. - Fixed bug where an exception and crash could occur if the query chromosome is not present in a bigWig, bigBed, or Bam file when collecting data. Chromosome names are now checked prior to query. - Fixed bug in script get_datasets.pl where a null value is returned instead of 0 when using the method of sum. - Removed several minor bugs that could generate non-fatal Perl warnings v.1.6.2 (svn r282) - Fixed bugs in script data2bed.pl that prevented a bigBed file from being generated. Also improved autodetection of data columns and allowed for dummy data to be inserted in lower column data when writing higher column data. Also added ability to use either the GFF Name or ID attribute as the Bed feature name. - Added span option to script data2wig.pl when making wig files. - Renamed script process_agilent.pl to process_microarray.pl. Completely restructured internal data to accomodate multi-slide arrays and other file formats, including NimbleGen and GenePix. - Removed annoying verbose output from script split_data_file.pl and improved efficiency. - Stopped writing index keys in the metadata of tim data file formats. Index is now automatically calculated and retained internally. Also avoids writing metadata automatically if it wasn't present in the first place. - Added summary export function to script manipulate_datasets.pl. This replicates the summary option from script get_relative_data.pl. - Added multi-column support to the subtract and division functions in script manipulate_datasets.pl. - Minor bug fixes and improvements to script map_oligo_data2gff.pl. - Improved script gff3_to_ucsc_table.pl to handle gzip files and make the UCSC bin column optional. - Added character escaping when generating GFF3 files. - Improved handling of BigWigSet directories in script big_file2gff3.pl where the set name is used as the final subdirectory in the target path. Also improved name handling. - Fixed bug in writing Sam files in script change_chr_prefix.pl. Also added increased support for pragmas and fasta sequences in GFF3 files, and support for non-standard text files. - Changed the score column name to the more meaningful outfile basename when writing summary files. - Fixed data collection from Bed files in script bin_genomic_data.pl. - Renamed script map_relative_data.pl to get_relative_data.pl; updated the POD to be more helpful. v.1.6.1 (svn r258) - updated the inline documentation for all perl scripts to include the version option v1.6.0 (svn r253) - added version numbers and reporting to all perl scripts and modules - retired a number of outdated scripts - renamed script map_data.pl to map_relative_data.pl v1.5.9 (svn r247) - updated script big_file2gff3.pl to generate BigWigSet conf stanzas with subtracks, also more thorough conf stanzas - added additional axis formatting options to script graph_profile.pl - fixed critical error in library tim_db_helper where relative coordinates were not correctly reported in function get_region_dataset_hash() - improved handling of opening a bigwigset database in library tim_db_helper::bigwig - major overhaul of script average_gene.pl to work with bed files, add new methods including rpm support, and general much-needed reorganization - improved error messaging in biotoolbox libraries by using confess instead of croak - reorganize the order of checking for the biotoolbox configuration in tim_db_helper::config v1.5.8 (svn r240) (not released) - fix some bugs with script graph_histogram.pl concerning the bins and their labels - updated script gff3_to_ucsc_table.pl to work with gene models without transcripts and fix bugs handling comments and pragmas - fixed bug with trimming windows in script find_enriched_regions.pl by including absolute option to get_region_dataset_hash() function in library tim_db_helper - added option to randomly assign strand for paired-end features to script bam2gff_bed.pl - fix chromosome regex issue with non-standard chromosome names in script bar2wig.pl - updated methods to get chromosome sizes in libraries tim_db_helper::bigwig and tim_db_helper::bigbed - added new parameter chromosome_exclude in configuration file biotoolbox.cfg, which allows specific chromosomes to be excluded when generating new feature or genomic interval lists - removed all references to key reference_sequence_type from config file biotoolbox.cfg and associated scripts - updated chromosome reference, and added logic to automatically identify column indices in script data2bed.pl - updated several scripts to use seq_ids to retrieve chromosome lists - fixed bug in script get_feature_info.pl where short feature lists would cause a failure when generating a list of possible attributes from sample features v1.5.7 (svn r227) (not released) - major overhaul of script get_datasets.pl - removed subs get_feature_dataset() and get_genome_dataset() from library tim_db_helper, functionality moved to script get_datasets.pl - added data color options to script graph_profile.pl - completely updated script map_data.pl to work with chromosome segments rather than named features, and added rpm support - added new sub to check datasets for rpm support in library tim_db_helper - fixed bug when specifying no datasets in script get_datasets.pl - improved support for BigWigSet databases in library tim_db_helper and script print_feature_types.pl v1.5.6 (svn r223) (not released) - added rpm method to score functions in library tim_db_helper - minor bug fixes and adjustments to help rpm method in tim_db_helper bigwig, bigbed, and bam libraries - minor bug fix in script find_enriched_regions.pl - fixed export bug in library tim_db_helper::bigbed - fixed bug in library tim_db_helper sub process_and_verify_dataset() where new datasets would never be prompted - corrected the method for counting bed features in library tim_db_helper::bigbed - fixed alignment collection to only take alignments with midpoint positions within the requested region in library tim_db_helper::bam v1.5.5 (svn r219) (not released) - added new avoid option to method get_region_dataset_hash() in library tim_db_helper - updated script map_data.pl to use get_region_dataset_hash() - fixed bug in method validate_dataset_list() in library tim_db_helper - fixed bug in script merge_datasets.pl where table headers may not be written properly - fixed bug in tim_db_helper::get_genome_dataset() if more than one segment was found - made numerous improvements in opening db connections in library tim_db_helper - made changes to assigning feature type when opening certain files in library tim_file_helper - fixed bug in library tim_db_helper where bed file coordinates were not written out in interbase - moved the sum_total_alignments() subroutine from the script bam2wig.pl to the library tim_db_helper::bam - added support for stranded paired-end RNA-Seq bam files aligned with TopHat which use the XS attribute to record strand information in scripts bam2wig.pl and bam2gff_bed.pl - disabled splices on paired-end bam files in script bam2wig.pl v1.5.4 (svn r209) (not released) - added more explicit support for bed files in the tim_file_helper and tim_data_helper libraries, including data structure verification, interbase to base conversion, and metadata handling - generalized bam and bigfile database handling to tim_db_helper libraries - simplified generating genomic windows in tim_db_helper -improved handling of collecting data from bigfile databases in tim_db_helper libraries - added chromosome feature output to script big_file2gff3.pl - updated numerous scripts to reflect tim_db_helper changes; general code cleanup - further simplification and code cleanup of library tim_db_helper, including database and dataset list verification, and removing redundant code in collecting dataset values - added new subroutine process_and_verify_dataset() to library tim_db_helper - updated scripts average_gene.pl, find_enriched_regions.pl, and map_data.pl to use the new sub process_and_verify_dataset() v1.5.3 (svn r205) - Fixed bug in script bam2wig.pl that prevented spliced alignments from being properly checked and recorded. - Fixed numerous bugs in script ucsc_table2gff3.pl, including a bug where the gene start coordinate may not be updated from interbase to base, and not accurately converting the CDS phase - Added new features to the script ucsc_table2gff3.pl, including automatic table retrieval through FTP from UCSC to greatly simplify conversion, adding support for knownGene and xenoRefGene tables, customizing the type of features to output, properly handling features with duplicate names by creating unique IDs, and optionally including chromosome information in the output GFF3 file - Deleted the now redundant script ucsc_chrom2gff3.pl v1.5.2 (svn r200) - Updated several scripts and libraries to fix bugs in handling GFF version numbers and pragmas. - Added unique IDs to the gff3 output from bam2gff_bed.pl - Added option to deal with multiple values at identical positions in the script data2wig.pl - Added support for log2 values when combining multiple values at identical postions in scripts data2wig.pl, bar2wig.pl, and useq2bigfile.pl. - Retired the outdated script just_blast_oligos.pl. v1.5.1 (svn r193) - Fixed critical bug in script bar2wig.pl where values from multiple postions were not combined properly. Also fixed bug with processing a single bar file. - Removed required dependencies of bioperl for scripts bar2wig.pl and useq2bigfile.pl - Fixed small bug in tim_db_helper::bigbed library to ensure positions were withing the region of interest - Added mapping quality filter and other improvements to script bam2wig.pl - Changed score reporting to record mapping quality in script bam2gff_bed.pl v1.5 (svn r184) - Added script useq2bigfile.pl for converting USeq archives - Added script check_dependencies.pl for assisting in checking for Perl module dependencies. It will help install the latest versions through CPAN - Changed the biotoolbox configuration file from lib/tim_db_helper.cfg to biotoolbox.cfg in the root directory. - Moved the biotoolbox configuration loader into a separate module as lib/tim_db_helper/config.pm. This avoids requiring installing BioPerl and loading all of tim_db_helper.pm when it may not be necessary. - Updated numerous scripts to reflect changes with the biotoolbox configuration loader. - added axes labeling options to scripts graph_data.pl and graph_histogram.pl - fixed bug in handling bed files in library tim_file_helper - minor fixes in script data2wig.pl - improved working with bigfile conversions - fixed minor bug in script big_file2gff3.pl when leaving files in the current directory v1.4.4 (svn r162) - Added reads per million option to script bam2wig.pl - Added parent, exon, and transcript_length attributes to script get_feature_info.pl - Updated scripts find_enriched_regions.pl and map_transcripts.pl to work with with standalone data files (BigWig, BigBed, Bam) - Added configuration, description, and capabilities to working with SQLite database files in tim_db_helper - Added midpoint as acceptable coordinate in script data2wig.pl - Bug fixes to script locate_SNPs.pl and bam2wig.pl; library tim_db_helper::bam v1.4.3 (svn r144) - Changed script bar2wig.pl to require method for combining values and removed interbase option - Updated peak indentification in script map_nucleosomes.pl to use the tag dataset and not the scan dataset - Updated script big_file2gff3.pl to produce more useful conf files with BigWigSets - Added overlap data column to ouput of script get_intersecting_features.pl and added --set_strand option to enforce directionality - Added three new functions to script manipulate_datasets.pl, including new column, strandsign, and mergestrand - Fixed script wig2data.pl so it works now - Updated script get_feature_info.pl to parse an attribute list from the command line - Improved handling of metadata when opening tim data files v1.4.2 (svn r129) - Added fast low level coverage function to the script bam2wig.pl - Fixed script pull_features.pl to keep the order of features in the list file. - Fixed script bar2wig.pl to correctly identify the chromosome name. - Various bug fixes to the database library helper tim_db_helper.pm. v1.4.1 (svn r119) - Fixed bug with get_ensembl_annotation.pl where a protein_coding gene encoding a transcript lacking a CDS will write inappropriate coordinates. These transcripts will not write start_codon, stop_codon, or CDS subfeatures. - Fixed bug with script get_intersecting_features.pl where selecting regions with a start, stop modifier was not being selected properly. - Fixed bug with tim_db_helper modules that prevented working with source data files specified in a database feature - Added log transformation of count in script bam2wig.pl v1.4 (svn r111) - Added script bam2wig.pl for enumerating alignments and writing a wig file of the counts. - Added script change_chr_prefix.pl for adding or stripping chromosome prefixes from data and annotation files. - Bug fixes to ucsc_table2gff3.pl. v1.3 (svn r104) - Added ability to restrict data collection to exon subfeatures to script get_datasets.pl. Useful for RNA-seq analysis. - Added exon count as attribute to script get_feature_info.pl. - Bug fixes to get_datasets.pl. v1.2 (svn r98) - Added support for bam files as a data source. - Updated data collection scripts to allow direct referencing of data source files, including bigWig, bigBed, and Bam files, on the command line, without having to reference the files from within the database. v1.1 (svn r92) - Updated script ucsc_table2gff3.pl to use Bio::SeqFeature::Lite. Now outputs exon and codon features. - Updated script get_ensembl_annotation.pl to collect RNA features from Ensembl as well as generate exon and codon features. - Added script gff3_to_ucsc_table.pl to generate UCSC style refSeq tables from GFF3 formatted data. v1.0.2 (svn r91) - Bug fixes to libs tim_file_helper and tim_db_helper - Bug fixes to scripts print_feature_types.pl, get_intersecting_features.pl, big_file2gff3.pl, graph_data.pl, graph_histogram.pl, graph_profile.pl v1.0 (svn r68) - Initial public release of an archive. Previous versions were only available through SVN.