The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

CDMI Package

Sapling Database Access Methods

Introduction

The CDMI database represents an instance of the Kbase Central Data Model. This object has minimal capabilities: most of its power comes the ERDB base class.

The fields in this object are as follows.

loadDirectory

Name of the directory containing the key load files.

tuning

Reference to a hash of tuning parameters.

Configuration and Construction

The database is governed by tuning parameters in an XML configuration file. The file name should be CdmiConfig.xml in the load directory. The tuning parameters that affect the way the data is loaded. These are specified as attributes in the TuningParameters element, as follows.

maxLocationLength

The maximum number of base pairs allowed in a single location. IsLocatedIn records are split into sections based on this length, so when you are looking for all the features in a particular neighborhood, you can look for locations within the maximum location distance from the neighborhood, and even if you have a huge operon that contains tens of thousands of base pairs, you'll still be able to find it.

maxSequenceLength

The maximum number of base pairs allowed in a single DNA sequence. DNA sequences are broken into segments to prevent excessively large genomes from clogging memory during sequence resolution.

Loading

Unlike a normal ERDB database, the CDMI is loaded in sections, usually one genome at a time, rather than in a massive full-database load. The standard load support is therefore not present.

Tuning Parameter Defaults

Each tuning parameter must have a default value, in case it is not present in the XML configuration file. The defaults are specified in a constant hash reference called TUNING_DEFAULTS.

new

    my $cdmi = CDMI->new(%options);

Construct a new CDMI object. The following options are supported.

loadDirectory

Data directory to be used by the loaders. The default is /var/kbase/cdm.

DBD

XML database definition file. The default is taken from the CDMIDBD environment variable, or KSaplingDBD.xml in the load directory if the environment variable is not set.

dbName

Name of the database to use. The default is kbase_sapling.

sock

Socket for accessing the database. The default is the system default.

userData

Name and password used to log on to the database, separated by a slash. The default is a user name of seed and no password.

dbhost

Database host name. The default is localhost.

port

MYSQL port number to use (MySQL only). The default is 3306.

dbms

Database management system to use (e.g. postgres). The default is mysql.

uuid

Data::UUID object for generating annotation IDs. Will not exist unless it's needed.

develop

If TRUE, then the development database will be used. The development database is located on a different server with a different DBD. This option overrides dbhost, externalDBD, dbname, and DBD.

new_for_script

    my $cdmi = CDMI->new_for_script(%options);

Construct a new CDMI object for a command-line script. This method uses a call to "getoptions" in GetOpt::Long to parse the command-line options, with the incoming options parameter as a parameter. The following command-line options (all of which are optional) will also be processed by this method and used to construct the CDMI object.

If the command-line parse fails, an undefined value will be returned rather than a CDMI object.

loadDirectory

Data directory to be used by the loaders.

DBD

XML database definition file.

dbName

Name of the database to use.

sock

Socket for accessing the database.

userData

Name and password used to log on to the database, separated by a slash.

dbhost

Database host name.

port

MYSQL port number to use (MySQL only).

dbms

Database management system to use (e.g. postgres, default mysql).

develop

If specified, then the development database will be used. This database is located on a different server with a different DBD. The develop option overrides dbhost, dbname and DBD, and forces use of an external DBD.

Public Methods

ComputeTaxonID

    my $taxID = $cdmi->ComputeTaxonID($scientificName);

Compute the best-match taxonomy ID for a genome with the specified scientific name. An attempt will be made to match to the strain and then the genus and species. If no match is found, an undefined value will be returned.

scientificName

Scientific name of the genome whose taxonomy ID is desired.

RETURN

Returns the ID of the best taxonomic grouping at which to attach the named genome, or undef if no such grouping can be found.

GetLocations

    my @locs = $cdmi->GetLocations($fid);

Return the locations of the DNA for the specified feature.

fid

ID of the feature whose location is desired.

RETURN

Returns a list of BasicLocation objects for the locations containing the feature's DNA.

GenesInRegion

    my @pegs = $cdmi->GenesInRegion($location);

Return a list of the IDs for the features that overlap the specified region on a contig.

location

Location of interest, either in the form of a location string (e.g. 360108.3:NZ_AANK01000002_264528_264007) or a BasicLocation object.

RETURN

Returns a list of feature IDs. The features in the list will be all those that overlap or occur inside the location of interest.

ComputeDNA

    my $dna = $sap->ComputeDNA($contig, $beg, $dir, $length);

Return the DNA sequence for the specified location.

contig

The ID of the contig containing the desired DNA.

beg

Location of the first desired base pair.

dir

+ for the plus strand and - for the minus strand.

length

Number of base pairs.

RETURN

Returns a string containing the desired DNA. The DNA comes back in pure lower-case.

Taxonomy

    my @taxonomy = $sap->Taxonomy($genomeID, $format);

Return the full taxonomy of the specified genome, starting from the domain downward.

genomeID

ID of the genome whose taxonomy is desired.

format (optional)

Format of the taxonomy. names will return primary names, numbers will return taxonomy numbers, and both will return taxonomy number followed by primary name. The default is names.

RETURN

Returns a list of taxonomy names, starting from the domain and moving down to the node where the genome is attached.

ComputeNewAnnotationID

    my $annotationID = $cdmi->ComputeNewAnnotationID($fid, $timeStamp);

Return a valid annotation ID for the specified feature and time stamp. The ID is formed from the feature ID and a complemented version of the time stamp followed by a UUID. The complemented time stamp causes the annotations to present in reverse chronological order and the feature ID causes annotations for the same feature to cluster together. This provides for efficient retrieval, though the keys are gigantic.

fid

ID of the target feature for the annotation.

timeStamp

time at which the annotation occurred

RETURN

Returns a unique ID to give to the annotation.

TuningParameter

    my $parm = $cdmi->TuningParameter($parmName);

Return the value of the specified tuning parameter. Tuning parameters are read from the XML configuration file.

parmName

Name of the parameter whose value is desired.

RETURN

Returns the paramter value.

ReadConfigFile

    my $xmlObject = $cdmi->ReadConfigFile();

Return the hash structure created from reading the configuration file, or an undefined value if the file is not found.

Virtual Methods

PreferredName

    my $name = $cdmi->PreferredName();

Return the variable name to use for this database when generating code.

LoadDirectory

    my $dirName = $cdmi->LoadDirectory();

Return the name of the directory in which load files are kept. The default is the FIG temporary directory, which is a really bad choice, but it's always there.

UseInternalDBD

    my $flag = $cdmi->UseInternalDBD();

Return TRUE if this database should be allowed to use an internal DBD. The internal DBD is stored in the _metadata table, which is created when the database is loaded. The Sapling uses an internal DBD.