The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MS::Search::DB - A class to facilitate construction of MS/MS protein search databases

SYNOPSIS

    use MS::Search::DB;

    my $db = MS::Search::DB->new;

    # add sequences from various sources

    $db->add_from_file("/path/to/proteins.faa");
    $db->add_from_url("http://foo.org/proteomes/XYZ.faa");
    $db->add_from_source(
        source   => 'uniprot',
        taxid    => '12345',
        ref_only => 1,
    );

    # add contaminant sequences from cRAP
    $db->add_crap();

    # remove exact duplicates
    $db->deduplicate();

    # generate decoy sequences
    $db->add_decoys(
        type => 'reverse',
        prefix => 'DECOY_',
    );

    # write to fh (default STDOUT)
    $db->write(
        fh        => $fh,
        randomize => 1,
    );

DESCRIPTION

MS::Search::DB is intended to facilitate easy construction of MS/MS protein search databases from various sources. It includes methods for fetching protein sequence data, adding common contaminant sequences, adding decoy sequences, and saving the database to disk.

METHODS

new

    my $db = MS::Search::DB->new();

    #or, initialize directly from a file
    my $db = MS::Search::DB->new('/path/to/proteins.faa');

Create a new MS::Search::DB object. A single optional argument pointing to a FASTA file is accepted, which will be loaded into the initial database.

add_from_file

    $db->add_from_file('/path/to/proteins.faa');

Takes one required argument (path to a protein FASTA file) and loads it into the database. Optionally takes a suffix that is added to each sequence ID. Returns the number of sequences added.

add_from_url

    $db->add_from_url(
        'http://somedb.org/proteomes/XYZ.faa',
        id_suffix => '_XYZ',
    );

Takes one required argument (URL referencing a FASTA file) and loads it into the database. Optionally takes a suffix that is added to each sequence ID. Returns the number of sequences added.

add_from_source

    $db->add_from_source(
        source    => 'uniprot',
        id_suffix => '_XYZ',
        # plugin-specific arguments
    );

Fetch data using an MS::Search::DB::Source plugin (specified via the 'source' argument). These plugins facilitate searching common sources of protein sequence data, such as NCBI or Uniprot. Please see the documentation for each individual plugin (under the MS::Search::DB::Source:: namespace) for details of the arguments each one accepts. Optionally takes a suffix that is added to each sequence ID. Returns the number of sequences added.

add_crap

    $db->add_crap();
    $db->add_crap($url);

Downloads common contaminant sequences and adds them to the database. By default, downloads the "common Repository of Adventitious Proteins", aka "cRAP", from GPM. An optional URL can be provided to fetch from another source.

deduplicate

    $db->deduplicate();

Removes exact duplicate entries (by sequence, not ID), which can sometimes cause issues with downstream software. Sequences are processed in the order in which they were added, so the first occurrence of each duplicated sequence is retained and all subsequent occurrences are discarded.

add_decoys

    $db->add_decoys(
        type => 'reverse',
        prefix => 'DECOY_',
    );

Generates a set of decoy sequences according to the arguments provided and adds them to the database. One decoy will be added for each protein in the original database. Possible arguments include:

  • type — how to generate the decoy sequences. Either 'reverse' or 'shuffle'. (default: reverse)

  • prefix — the prefix to be added to each decoy ID. (default: "DECOY_")

Note that the order in which this method is called matters. Only sequences that have already been added to the database before it is called will be used for decoy generation.

write

    $db->write(
        fh => $fh,
        randomize => 1,
    );

Write database to disk as FASTA file. Possible arguments include:

  • fh — filehandle to write to (default: STDOUT)

  • randomize — whether to randomly shuffle sequences before writing (default: 0)

CAVEATS AND BUGS

Please reports bugs or feature requests through the issue tracker at https://github.com/jvolkening/p5-MS/issues.

SEE ALSO

AUTHOR

Jeremy Volkening <jdv@base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2015-2020 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.