The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::EBI::RNAseqAPI - A Perl interface to the EMBL-EBI RNA-seq analysis API.

DESCRIPTION

This module provides a Perl-based interface to the EMBL-EBI RNA-seq analysis API.

The RNA-seq Analysis API enables access to analysis results for thousands of publicly available gene expression datasets. This module provides functions to access each endpoint provided by the API.

For more information about the API, see its documentation.

SYNOPSIS

 use 5.10.0;
 use Bio::EBI::RNAseqAPI;

 my $rnaseqAPI = Bio::EBI::RNAseqAPI->new;

 my $runInfo = $rnaseqAPI->get_runs_by_study(
    study => "E-MTAB-513", 
    minimum_mapped_reads => 0
 );

ATTRIBUTES

run_organism_list

An anonymous hash containing allowed organism names for downloading run information as keys. Access the contents like this:

 my $runOrganisms = $rnaseqAPI->get_run_organism_list;
expression_organism_list

An anonymous hash containing allowed organism names for downloading gene expression information as keys. Access the contents like this:

 my $expressionOrganisms = $rnaseqAPI->get_expression_organism_list;

METHODS

Analysis results per sequencing run

These functions take arguments in the form of a hash. These usually consist of a study accession, or one or more run accessions, plus a value for "minimum_mapped_reads". This value represents the minimum percentage of mapped reads to allow for each run in the results. Only information for runs with a percentage of mapped reads greater than or equal to this value will be returned. To get all available information, set "minimum_mapped_reads" to zero.

Analysis information for each run is returned in an anonymous hash. Some functions return anonymous arrays with one anonymous hash per run found. See below for examples and more information about the results.

get_run

Accesses the API's getRun JSON endpoint and returns analysis information for a single run, passed in the arguments.

Arguments should be passed as a hash containing values for "run" and "minimum_mapped_reads", e.g.:

 my $runInfo = $rnaseqAPI->get_run(
    run => "ERR030885",
    minimum_mapped_reads => 0
 );

Run analysis information is returned in an anonymous hash. Returns undef (and logs errors) if errors are encountered.

An example of the hash returned is as follows:

 {
     'BIOREP_ID' => 'ERR030885',
     'RUN_IDS' => 'ERR030885',
     'REFERENCE_ORGANISM' => 'homo_sapiens',
     'MAPPING_QUALITY' => 96,
     'ASSEMBLY_USED' => 'GRCh38',
     'ORGANISM' => 'homo_sapiens',
     'BEDGRAPH_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/ERR030/ERR030885/ERR030885.bedgraph',
     'ENA_LAST_UPDATED' => 'Mon Aug 18 2014 13:40:46',
     'STUDY_ID' => 'ERP000546',
     'BIGWIG_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/ERR030/ERR030885/ERR030885.bw',
     'CRAM_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/ERR030/ERR030885/ERR030885.cram',
     'LAST_PROCESSED_DATE' => 'Sun Jul 12 2015 23:31:47',
     'STATUS' => 'Complete',
     'SAMPLE_IDS' => 'SAMEA962348'
 }
get_runs_by_list

This function takes an anonymous array of run accessions and sequentially accesses the API's getRun JSON endpoint to collect the analysis information for each run in the list provided.

 my $runInfo = $rnaseqAPI->get_runs_by_list(
    runs => [ "ERR030885", "ERR030886" ],
    minimum_mapped_reads => 0
 );

Run analysis information is returned as an anonymous array containing one anonymous hash per run (see "get_run" documentation for an example of what the anonymous hash looks like). Returns undef (and logs errors) if errors are encountered.

get_runs_by_study

Accesses the API's getRunsByStudy JSON endpoint, and returns an anonymous array containing an anonymous hash for each run found (see "get_run" docs for an example).

 my $runInfo = $rnaseqAPI->get_runs_by_study(
    study => "E-MTAB-513",
    minimum_mapped_reads => 0
 );

Study accession can be either an ENA, SRA, DDBJ or ArrayExpress study accession. The example above uses an ArrayExpress experiment accession. Examples of ENA, SRA or DDBJ accessions are "ERP000546" or "SRP013533" or "DRP000391", respectively.

Returns undef (and logs errors) if errors are encountered.

get_runs_by_organism

Accesses the API's getRunsByOrganism JSON endpoint, and returns an anonymous array containing an anonymous hash for each run found.

 my $runInfo = $rnaseqAPI->get_runs_by_organism(
    organism => "homo_sapiens",
    minimum_mapped_reads => 70
 );

Value for "organism" attribute is a species scientific name, in lower case, with underscores instead of spaces. E.g. "homo_sapiens", "canis_lupus_familiaris", "oryza_sativa_japonica_group". To ensure your organism name is allowed, check against the "run_organism_list" attribute:

 my $organism = "oryctolagus_cuniculus";
 my $organismList = $rnaseqAPI->get_run_organism_list;
 if( $organismList->{ $organism } ) {
     say "Found $organism!";
 }

Results are returned as an anonymous array containing an anonymous hash for each run found. Returns undef (and logs errors) if errors are encountered.

get_runs_by_organism_condition

Accesses the API's getRunsByOrganismCondition JSON endpoint, and returns an anonymous array containing an anonymous hash for each run found. An organism name and a "condition" -- meaning a sample attribute -- are passed in the arguments. The condition must exist in the Experimental Factor Ontology (EFO); this can be checked via the EFO website or via the Ontology Lookup Service (OLS) API.

 my $runInfo = $rnaseqAPI->get_runs_by_organism_condition(
    organism => "homo_sapiens",
    condition => "central nervous system",
    minimum_mapped_reads => 70
 );

See "get_runs_by_organism" docs for how to check organism name format and availability.

Returns undef (and logs errors) if errors are encountered.

Analysis results per study

These functions take an ENA, SRA, DDBJ or ArrayExpress accession or species name and return information about the corresponding dataset(s).

get_study

Accesses the API's getStudy JSON endpoint. Single argument is a study accession (ENA, SRA, DDBJ or ArrayExpress). Returns an anonymous hash containing the results for the matching study. Returns undef (and logs errors) if errors are encountered. If you try an ArrayExpress accession and it doesn't work, try the corresponding sequencing archive study accession instead.

 my $studyInfo = $rnaseqAPI->get_study( "SRP033494" );

An example of the anonymous hash returned is as follows:

 {
     'GTF_USED' => 'Arabidopsis_thaliana.TAIR10.26.gtf.gz',
     'ORGANISM' => 'arabidopsis_thaliana',
     'STATUS' => 'Complete',
     'ASSEMBLY_USED' => 'TAIR10',
     'LAST_PROCESSED_DATE' => 'Thu Jun 30 2016 19:55:56',
     'GENES_FPKM_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/genes.rpkm.tsv',
     'EXONS_FPKM_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/exons.rpkm.tsv',
     'GENES_TPM_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/genes.tpm.tsv',
     'SOFTWARE_VERSIONS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/irap.versions.tsv',
     'REFERENCE_ORGANISM' => 'arabidopsis_thaliana',
     'STUDY_ID' => 'SRP033494',
     'EXONS_RAW_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/exons.raw.tsv',
     'GENES_RAW_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/genes.raw.tsv',
     'EXONS_TPM_COUNTS_FTP_LOCATION' => 'ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/atlas/rnaseq/studies/ena/SRP033494/exons.tpm.tsv'
 } 
get_studies_by_organism

Accesses the API's getStudiesByOrganism JSON endpoint. Single argument is the name of an organism (see the "run_organism_list" attribute for allowed names). Returns an anonymous array containing one anonymous hash per study found. See "get_study" docs for an example of an anonymous hash.

 my $studies = $rnaseqAPI->get_studies_by_organism( "arabidopsis_thaliana" );

Sample attributes per run

These functions return information about the sample attributes that runs are annotated with. Sample attributes have a "type", e.g. "organism", and a "value", e.g. "Homo sapiens". Where possible, the URI of the matching ontology term is also annotated.

get_sample_attributes_by_run

Accesses the API's getSampleAttributesByRun JSON endpoint. Single argument is the accession of the run. Returns an anonymous array containing one anonymous hash per sample attribute found.

 my $sampleAttributes = $rnaseqAPI->get_sample_attributes_by_run( "SRR805786" );

An example of the results returned is as follows:

 [
    {
        'VALUE' => 'peripheral blood mononuclear cells (PBMCs)',
        'EFO_URL' => 'NA',
        'STUDY_ID' => 'SRP020492',
        'TYPE' => 'cell type',
        'RUN_ID' => 'SRR805786'
    },
    {
        'STUDY_ID' => 'SRP020492',
        'TYPE' => 'organism',
        'RUN_ID' => 'SRR805786',
        'VALUE' => 'Homo sapiens',
        'EFO_URL' => 'http://purl.obolibrary.org/obo/NCBITaxon_9606'
    },
 ]

EFO_URL is not always present and will be "NA" if it is not.

Returns undef (and logs errors) if errors are encountered.

get_sample_attributes_per_run_by_study

Accesses the API's getSampleAttributesPerRunByStudy JSON endpoint. Single argument is a study accession. Returns an array ref containing one anonymous hash per sample attribute. See "get_sample_attributes_by_run" docs for an example. Returns undef (and logs errors) if errors are encountered.

 my $sampleAttributes = $rnaseqAPI->get_sample_attributes_per_run_by_study( "DRP000391" );
get_sample_attributes_coverage_by_study

Accesses the API's getSampleAttributesCoverageByStudy endpoint. Single argument is a study accession. Returns an anonymous array containing one anonymous hash per sample attribute. Returns undef (and logs errors) if errors are encountered.

 my $sampleAttributeCoverage = $rnaseqAPI->get_sample_attributes_coverage_by_study( "DRP000391" );

An example of the results is as follows:

 [
    {
        'VALUE' => 'Nipponbare',
        'STUDY_ID' => 'DRP000391',
        'PCT_OF_ALL_RUNS' => 100,
        'NUM_OF_RUNS' => 28,
        'TYPE' => 'cultivar'
    },
    {
        'VALUE' => 'Oryza sativa Japonica Group',
        'STUDY_ID' => 'DRP000391',
        'PCT_OF_ALL_RUNS' => 100,
        'NUM_OF_RUNS' => 28,
        'TYPE' => 'organism'
    },
    {
        'VALUE' => '7 days after germination',
        'STUDY_ID' => 'DRP000391',
        'PCT_OF_ALL_RUNS' => 29,
        'NUM_OF_RUNS' => 8,
        'TYPE' => 'developmental stage'
    }
 ]

Baseline gene expression per tissue, cell type, developmental stage, sex, and strain

get_expression_by_organism_genesymbol

Accesses the API's getExpression endpoint. Provide arguments as a hash, passing an organism name and a gene symbol, as well as a value for the minimum percentage of mapped reads to allow:

 my $geneExpressionInfo = $rnaseqAPI->get_expression(
    minimum_mapped_reads => 0,
    organism    => "oryza_sativa",
    gene_symbol => "BURP7"
 );

Results are returned as an anonymous array of anonymous hashes, with one anonymous hash per unique combination of tissue, cell type, developmental stage, sex, and strain. The median expression level of all runs is given in TPM (transcripts per million). Returns undef (and logs errors) if errors are encountered.

An example of the results returned is as follows:

 [
    {
        'COEFFICIENT_OF_VARIATION' => '0.3',
        'STRAIN' => 'NA',
        'DEVELOPMENTAL_STAGE' => 'seedling, two leaves visible, three leaves visible',
        'CELL_TYPE' => 'NA',
        'SEX' => 'NA',
        'GENE_ID' => 'OS05G0217700',
        'MEDIAN_EXPRESSION' => '831.1',
        'NUMBER_OF_RUNS' => 60,
        'ORGANISM' => 'oryza_sativa',
        'ALL_SAMPLE_ATTRIBUTES' => 'http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getSampleAttributesByCondition/3238',
        'ORGANISM_PART' => 'shoot, vascular leaf'
    },
    {
        'STRAIN' => 'NA',
        'DEVELOPMENTAL_STAGE' => '20 days after sowing',
        'COEFFICIENT_OF_VARIATION' => '0.3',
        'CELL_TYPE' => 'NA',
        'GENE_ID' => 'OS05G0217700',
        'SEX' => 'NA',
        'ORGANISM' => 'oryza_sativa',
        'NUMBER_OF_RUNS' => 4,
        'MEDIAN_EXPRESSION' => '433.5',
        'ALL_SAMPLE_ATTRIBUTES' => 'http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getSampleAttributesByCondition/3192',
        'ORGANISM_PART' => 'leaf'
    },
get_expression_by_gene_id

Accesses the API's getExpression endpoint, but instead of querying by organism and gene symbol (see "get_expression_by_organism_genesymbol"), this function queries by gene identifier. Also expects a value for the minimum percentage of mapped reads to allow.

 my $geneExpressionInfo = $rnaseqAPI->get_expression(
    gene_identifer  => "ENSG00000172023",
    minimum_mapped_reads => 0
 );

Results are returned as an anonymous array of anonymous hashes, with one anonymous hash per unique combination of tissue, cell type, developmental stage, sex, and strain. See "get_expression_by_organism_genesymbol" for an example. The median expression level of all runs is given in TPM (transcripts per million). Returns undef (and logs errors) if errors are encountered.

AUTHOR

Maria Keays <mkeays@cpan.org>

The above email should be used for feedback about the Perl module only. All mail regarding the RNA-seq analysis API itself should be directed to <rnaseq@ebi.ac.uk>.