The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

filter_bam.pl

A script to filter a Bam file for specific criteria.

SYNOPSIS

filter_bam.pl <file.bam>

  Options:
  --in <file.bam>
  --out <filename>
  --(no)pass
  --(no)fail
  --(no)align
  --mismatch
  --gap
  --indel
  --mproper
  --mseqid
  --mstrand
  --score <integer>
  --length <integer>
  --seq <pos:[ATCG]>
  --attrib <key:value>
  --index
  --version
  --help

OPTIONS

The command line flags and descriptions:

--in <file.bam>

Specify the file name of a binary Bam file as described for Samtools. It does not need to be sorted or indexed.

--out <filename>

Optionally specify the base name of the output file. The default is to use input base name, appended with '.filter'. If both pass and fail files are written, then they are appended with '.pass' and '.fail', respectively.

--pass
--nopass

Indicate whether (or not) alignments which pass the test criteria should be written to an output Bam file. The default is true.

--fail
--nofail

Indicate whether (or not) alignments which fail the test criteria should be written to an output Bam file. The default is false.

--align
--noalign

Indicate whether (or not) aligned reads should pass. The default is true.

--mismatch

Indicate that only alignments with a mismatch should pass. A mismatch is indicated by either the NM or MD attributes of the alignment, or by the presence of X (mismatch) operations in the CIGAR string. Gaps, clipped, or padded sequences are not counted.

--gap

Indicate that only alignments with a gap should pass. Gaps are determined by the presence of N (skipped) operations in the CIGAR string.

--indel

Indicate that only alignments with either an insertion or deletion should pass. Indels are determined by the presence of I (insertion) or D (deletion) operations in the CIGAR string.

--mproper

Indicate that only alignments that are part of a proper pair should pass. Proper pairs are Forward-Reverse alignments on the same reference, and do not include Forward-Forward, Reverse-Reverse, Reverse-Forward, or separate reference sequence alignments.

--mseqid

Indicate that only paired alignments that are on the same reference sequence should pass.

--mstrand

Indicate that only paired alignments that align to different strands should pass, i.e. a Forward-Reverse or Reverse-Forward.

--score <integer>

Indicate that only alignments which have a quality score equal or greater than that indicated shall pass. The mapping quality score is a posterior probability that the alignment was mapped incorrectly, and reported as a -10Log10(P) value, rounded to the nearest integer (range 0..255). Higher numbers are more stringent.

--length <integer>

Indicate that only alignments whose query sequence equals the indicated length shall pass. Provide a comma-delimited list and/or range of lengths. Note that only the query sequence is checked, not the length of the alignment. Multiple lengths are treated as a logical OR operation.

--seq <pos:[ATCG]>

Indicate that only alignments that have a specific nucleotide at a specific position in the query sequence shall pass. Provide a position:nucleotide pair, where position is a 1-based integer and the nucleotide is one or more of A,C,G, or T. Providing two or more nucleotides per position is treated as a logical OR operation. Multiple sequence positions may be tested by issuing multiple command line options, in which case they are combined in a logical AND operation.

--attrib <key>
--attrib <key:value>

Indicate that only alignments that contain a specific optional attribute shall pass. One or more values may also be provided for the key, in which case only those alignments which match one of the key values shall pass. The values may be provided as a comma delimited list separated from the key by a colon. Attribute keys are typically two letter codes; see the SAM specification at http://samtools.sourceforge.net/SAM1.pdf for a list of standard attributes. Two or more key values are combined in a logical OR operation. Two or more attribute keys may be tested by specifying multiple --attrib command line options; in this case, they are combined in a logical AND operation.

--index

Optionally re-index the output bam file(s) when finished. If necessary, the bam file is sorted by coordinate first. Default is false.

--version

Print the version number.

--help

Display the POD documentation

DESCRIPTION

This program will filter the alignments in a Bam file according to a series of one or more boolean tests. Alignments which pass all the tests are written to an output Bam file. Alignments which do not pass one or more filters may be optionally written to a second Bam file.

There are a number of tests that may be applied to the alignments, controlled by command line arguments. Please note carefully how the test is performed and whether your desired outcome should be the pass or fail outcome. When multiple tests are indicated, they are combined using a logical AND operation.

The input and output files are BAM files as described by the Samtools project (http://samtools.sourceforge.net).

EXAMPLES

Here are a few examples of how to use filters.

Alignments that may indicate a SNP

SNPs could be either a mismatch, insertion, or deletion

 filter_bam.pl --mismatch --indel --in file.bam
RNASeq alignments that could span an intron
 filter_bam.pl --gap --in file.bam
MNase digested DNA

Chromatin may be digested using MNase, which cuts blunt ends between [AT][AT] dinucleotides. To increase the likelihood that sequences were derived from MNase digestion, filter for an [AT] nucleotide at the first position.

 filter_bam.pl --seq 1:AT --in file.bam
Alignments indicating chromosomal rearrangement

Paired-end sequencing of genomic DNA where two ends map to separate chromosomes or not in a proper forward-reverse arrangement may suggest a chromosomal rearrangement. In this case, we want those alignments that fail the test.

 filter_bam.pl --nopass --fail --mproper --out non_properly_paired --in file.bam
 
 filter_bam.pl --nopass --fail --mseqid --out translocations --in file.bam

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.