The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

obogaf::parser - a perl5 module to handle obo and gaf file

SYNOPSIS

use obogaf::parser;

my ($graph, $subonto, $res, $stat, $parORchdlist, $newobo);

$graph= build_edges(obofile);

$subonto= build_subonto(edgesfile, namespace);

$stat= make_stat(edgesfile, parentIndex, childIndex);

$parORchdlist= get_parents_or_children_list(edgesfile, parentIndex, childIndex, parORchd);

$newobo= obo_filter(obofile, termsfile);

($res, $stat)= gene2biofun(annfile, geneIndex, classIndex);

($res, $stat)= map_OBOterm_between_release(obofile, annfile, classIndex);

ABSTRACT

obogaf::parser is a perl5 module desinged to handle open biological and biomedical ontology and gene association file.

DESCRIPTION

obogaf::parser is a perl5 module specifically designed to handle GO and HPO obo (Open Biological and Biomedical Ontology) file and their Gene Annotation File (gaf file). However, all the obogaf::parser subroutines can be safely used to parse any obo file listed in OBO foundry and any gene annotation file structured as those shown in GOA website and HPO website (basically a csv file using tab as separator).

build_edges - extract edges from an obo file.
    my $graph= build_edges(obofile);

obofile: any obo file listed in OBO foundry. The file extension must be ".obo".

output: the graph is returned as tuple: subdomain <tab> source-ID <tab> destination-ID <tab> relationship <tab> source-name <tab> destination-name. This means that the graph is returned as a list of edges, where each edge is represented as a pair of vertices in the form source <tab> destination. For each couple of nodes, the subdomain (if any), the relationships for which is safe group annotations (i.e. is_a and part_of) and the names of source/destination obo terms-ID are returned as well. The graph is stored as an anonymous scalar.

build_subonto - extract edges of a specified sub-ontology domain.
    my $subonto= build_subonto(edgesfile, namespace);

edgesfile: a graph in the form: subdomain <tab> source <tab> destination <tab> relationship <tab> source-name <tab> destination-name. This file can be obtained by calling the subroutine build_edges. NB: to run this subroutine, the fields relationship, source-name and destination-name are optionals. Instead, the field subdomain is required and must be placed at the first column, otherwise an error message is returned.

namespace: name of the subontology for which the edges must be extracted.

output: the graph is returned as a tuple>: source <tab> destination <tab> relationship. In other words the graph is returned as a list of edges, where each edge is represented as a pair of vertices in the form source <tab> destination. For each couple of nodes the relationships is_a and part_of are also returned. The graph is stored as an anonymous scalar.

make_stat - make basic statistic on graph.
    my $stat= make_stat(edgesfile, parentIndex, childIndex);

edgesfile: a graph represented as a list of edges, where each edge is stored as a pair of vertices <tab> separated. This file can be obtained by calling the subroutine build_edges.

parentIndex: index referring to the column containing the parent (source) vertices in edgesfile file.

childIndex: index referring to the column containing the child vertices (destination) in the edgesfile file.

output: statistics about the graph are printed on the shell. More precisely, for each vertex of the graph degree, in-degree and out-degree are printed. The vertex are sorted in a decreasing order on the basis of degree, from the higher degree to the smaller degree. Finally, the following statistics are returned as well: 1) number of nodes and edges of the graph; 2) maximum and minimum degree; 3) average and median degree; 4) density of the graph.

get_parents_or_children_list - build parents or children list for each node of the graph.
    my $parORchdlist= get_parents_or_children_list(edgesfile, parentIndex, childIndex, parORchd);

edgesfile: a graph represented as a list of edges, where each edge is stored as a pair of vertices <tab> separated. This file can be obtained by calling the subroutine build_edges.

parentIndex: index referring to the column containing the parent (source) vertices in edgesfile file.

childIndex: index referring to the column containing the child vertices (destination) in the edgesfile file.

parORchd: must be parents or children. If $parORchd=parents a pipe separated list containing the parents of each node of the graph is returned; if $parORchd=children a pipe separated list containing the children of each node is returned.

output: an anonymous hash storing for each node of the graph the list of its children or parents according to the parORchd parameter.

obo_filter - prune obo file
    $newobo= obo_filter(obofile, termsfile);

obofile: any obo file listed in OBO foundry. The file extension must be ".obo".

termsfile: file containing the set of obo terms (new line separated) for which obo file must be shortened

output: an anonymous scalar storing the terms listed in the file termsfile according to the obo structure

gene2biofun - make annotations adjacency list.
    my ($res, $stat)= gene2biofun(annfile, geneIndex, classIndex);

annfile: an annotations file. The file extension can be either plain format (".txt") or compressed (".gz"). An example of the format of this file can be taken from GOA website (file with ".gaf.gz" extension) or HPO website. More in general any file structured as those aforementioned can be used (basically a ".csv" file using <tab> as separator).

geneIndex: index referring to the column containing the samples (genes/proteins).

classIndex: index referring to the column containing the ontology terms.

output: a list of two anonymous references. The first is an anonymous hash storing for each gene (or protein) all the associated ontology terms (pipe separated). The second is an anonymous scalar containing basic statistics, such as the total unique number of genes/proteins and annotated ontology terms.

map_OBOterm_between_release - map ontology terms between different releases.
    my ($res, $stat)= map_OBOterm_between_release(obofile, annfile, classIndex);

obofile: an obo file (a new release). This file is used to make the alt_id - id pairing, by using alt_id as key. The file extension must be ".obo".

annfile: an annotation file (an old release). The file extension can be either plain format (".txt") or compressed (".gz").

classIndex: index referring to the column of the annfile containing the ontology terms to be mapped.

output: a list of two anonymous references. The first is an anonymous scalar storing the annotations file in the same format of the input file but with the obsolete ontology terms replaced with the updated ones. The second reference is an anonymous scalar containing some basic statistics, such as the total unique number of ontology terms and the total number of mapped and not mapped altID ontology terms. Finally, all the found pairs alt_id - id are returned (if any).

BUGS

Please report any bugs here.

COPYRIGHT

Copyright (C) 2019 Marco Notaro, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5 programming language system itself.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

AUTHOR

Marco Notaro (https://marconotaro.github.io)

SEE ALSO

A step-by-step tutorial showing how to apply obogaf::parser to real biomedical case studies is available at the following link https://obogaf-parser.readthedocs.io.