The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::Parser::bed - Parser for BED-style formats

SYNOPSIS

  use Bio::ToolBox::Parser;
  my $filename = 'file.bed';
  
  my $Parser = Bio::ToolBox::Parser->new(
        file    => $filename,
  ) or die "unable to open gff file!\n";
  # the Parser will taste the file and open the appropriate 
  # subclass parser, bed in this case
  
  while (my $feature = $Parser->next_top_feature() ) {
        # each $feature is parent SeqFeature object
        printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
  }

DESCRIPTION

This is the BED-style specific parser subclass to the Bio::ToolBox::Parser object, and as such inherits generic methods from the parent. File formats include the following.

Bed

Bed files may have 3-12 columns, where the first 3-6 columns are basic information about the feature itself, and columns 7-12 are usually for defining subfeatures of a transcript model, including exons, UTRs (thin portions), and CDS (thick portions) subfeatures. This parser will parse these extra fields as appropriate into subfeature SeqFeature objects. Bed files are recognized with the file extension .bed.

Bedgraph

BedGraph files are a type of wiggle format in Bed format, where the 4th column is a score instead of a name. BedGraph files are recognized by the file extension .bedgraph or .bdg.

narrowPeak

narrowPeak files are a specialized Encode variant of bed files with 10 columns (typically denoted as bed6+4), where the extra 4 fields represent score attributes to a narrow ChIPSeq peak. These files are parsed as a typical bed6 file, and the extra four fields are assigned to SeqFeature attribute tags signalValue, pValue, qValue, and peak, respectively. NarrowPeak files are recognized by the file extension .narrowPeak.

broadPeak

broadPeak files, like narrowPeak, are an Encode variant with 9 columns (bed6+3) representing a broad or extended interval of ChIP enrichment without a single "peak". The extra three fields are assigned to SeqFeature attribute tags signalValue, pValue, and qValue, respectively. BroadPeak files are recognized by the file extension .broadPeak.

Track and Browser lines are generally ignored, although a track definition line containing a type key will be interpreted if it matches one of the above file types.

SeqFeature default values

The SeqFeature objects built from the bed file intervals will have some inferred defaults.

Coordinate system

SeqFeature objects use the 1-based coordinate system, per the specification of Bio::SeqFeatureI, so the 0-based start coordinates of bed files will always be parsed into 1-based coordinates.

display_name

SeqFeature objects will use the name field (4th column in bed files), if present, as the display_name. The SeqFeature object should default to the primary_id if a name was not provided.

primary_id

It will use a concatenation of the sequence ID, start (original 0-based), and stop coordinates as the primary_id, for example 'chr1:0-100'.

primary_tag

Bed files don't have an inherent attribute of feature type (they are all the same type), so a default primary_tag is assigned based on the file type. For peak files (narrowPeak and broadPeak) this is peak, for gappedPeak this is gappedPeak and peak (subfeatures), and for bed12 files with transcript models, the transcripts will be set to either mRNA or ncRNA, depending on the presence of interpreted CDS start and stop (thick coordinates).

source_tag

Bed files don't have a concept of a source; default is "".

attribute tags

Extra columns in the narrowPeak and broadPeak formats are assigned to attribute tags as described above. The rgb values set in bed12 files are also set to an attribute tag.

METHODS

Initializing the parser object

In most cases, users should initialize an object using the generic Bio::ToolBox::Parser object.

These are class methods to initialize the parser with an annotation file and modify the parsing behavior. Most parameters can be set either upon initialization or as class methods on the object. Unpredictable behavior may occur if you implement these in the midst of parsing a file.

Do not open subsequent files with the same object. Always create a new object to parse a new file.

new
  my $parser = Bio::ToolBox::Parser::bed->new($filename);
  my $parser = Bio::ToolBox::Parser::bed->new(
      file    => 'file.bed',
      do_gene => 1,
      do_cds  => 1,
  );

Initiate a new Bed file parser object. Pass a single value (the bed file name) to open the file for parsing. Alternatively, pass an array of key value pairs to control how the table is parsed. These options are primarily for parsing bed12 files with subfeatures. Options include the following.

file

Provide the path and file name for a Bed file. The file may be gzip compressed.

source

Pass a string to be added as the source tag value of the SeqFeature objects.

do_exon
do_cds
do_utr
do_codon

For Bed12 formats that represent transcripts, pass a boolean (1 or 0) value to parse certain subfeatures, including exon, CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features. Default is false.

class

Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature, which is lighter-weight and consumes less memory. A suitable BioPerl alternative is Bio::SeqFeature::Lite.

Other methods

Additional methods for working with the parser object and the parsed SeqFeature objects.

typelist

Returns a string representation of the type of SeqFeature types to be encountered in the file. Currently this returns generic strings, 'mRNA,ncRNA,exon,CDS' for bed12 and 'feature' for everything else.

SEE ALSO

Bio::ToolBox::Parser, Bio::ToolBox::SeqFeature

AUTHOR

 Timothy J. Parnell, PhD
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.