The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::FastParsers::CdHit - Front-end class for CD-HIT parser

VERSION

version 0.221230

SYNOPSIS

    use aliased 'Bio::FastParsers::CdHit';

    # open and parse CD-HIT report (cluster file)
    my $infile = 'test/cdHit.out.clstr';
    my $report = CdHit->new( file => $infile );

    # loop through representatives to get members
    for my $repr ( $report->all_representatives ) {
        my $members = $report->members_for($repr);
        # ...
    }

    # get representatives ordered by descending cluster size
    my @reprs = $report->all_representatives_by_cluster_size;

    # create IdMapper
    # Note: this requires Bio::MUST::Core
    my $mapper = $report->clust_mapper(':');
    my @long_ids = $mapper->all_long_ids;

    # ...

DESCRIPTION

This module implements a parser for the output file of the CD-HIT program. It provides methods for getting the ids of the representative sequences (either sorted by descending cluster size or not) and for obtaining the members of any cluster from the id of its representative.

It also has a method for facilitating the re-mapping of all the ids of every cluster on a phylogenetic tree through a Bio::MUST::Core::IdMapper object.

ATTRIBUTES

file

Path to CD-HIT report file to be parsed

METHODS

all_representatives

Returns all the ids of the representative sequences of the clusters (not an array reference).

    # $report is a Bio::FastParsers::CdHit
    for my $repr ( $report->all_representatives ) {
        # process $repr
        # ...
    }

This method does not accept any arguments.

all_representatives_by_cluster_size

Returns all the ids of the representative sequences of the clusters (not an array reference) sorted by descending cluster size (and then lexically by id).

    # $report is a Bio::FastParsers::CdHit
    for my $repr ( $report->all_representatives_by_cluster_size ) {
        # process $repr
        # ...
    }

This method does not accept any arguments.

members_for

Returns all the ids of the member sequences of the cluster corresponding to the id of the specified representative (as an array refrence).

    # $report is a Bio::FastParsers::CdHit
    for my $repr ( $report->all_representatives ) {
        my $members = $report->members_for($repr);
        # process $members ArrayRef
        # ...
    }

This method requires one argument: the id of the representative.

clust_mapper

Returns a Bio::MUST::Core::IdMapper object associating representative sequence ids to stringified full lists of their member sequence ids (including the representatives themselves).

This method needs Bio::MUST::Core to be installed on the computer.

    # $report is a Bio::FastParsers::CdHit
    my $mapper = $report->clust_mapper(':');

The native methods from Bio::MUST::Core::IdMapper can be applied on $mapper, e.g., all_long_ids or long_id_for.

This method accepts an optional argument: the id separator (default: /).

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

CONTRIBUTOR

Amandine BERTRAND <amandine.bertrand@doct.uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.