The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MassCalculator - Implements common mass computations in mass spectrometry

SYNOPSIS

  use InSilicoSpectro::InSilico::MassCalculator;
  InSilicoSpectro::InSilico::MassCalculator::init('insilicodef.xml');

DESCRIPTION

MassCalculator Perl library is intended to support common mass spectrometry-related computations that must be routinely performed by bioinformaticians dealing with mass spectrometry data.

To accommodate as many as possible user requirements, we decided to both support a classical procedural programming model and an object oriented model. Per se MassCalculator is not designed as an object oriented code but we provide a series of elementary classes for modeling proteins, peptides, enzymes, etc. which MassCalculator is compatible with. Moreover, the latter classes are rather simple and neutral in their design such that they should fit, after further derivations eventually, a large range of code design.

We decided not to use Perl object oriented programming to stay with relatively naive and simple code and to allow everybody to decide how to include it in its own project. Nonetheless, MassCalculator is able to deal with Perl objects we provide additionally to represent protein sequences, peptides, enzymes, mass lists, and fragmentation spectra, see their respective documentations.

MassCalculator is released under the LGPL license (see source code).

Configuration file and init function

One XML configuration file is necessary for MassCalculator.pm to work. It contains definitions of chemical elements, amino acids, fragmentation types, enzymes, etc.

These configuration files are read by the function init, which must be called before the other functions are used. It is possible to provide function init with another, or several others, file name(s).

init([$filename])

Opens the given file, or file ${phenyx.config.cleavenzymes} if no parameter was given, and stores all the modif in the dictionary.

massFromComposition($formula)

Computes monoisotopic and average masses from an atomic composition given in $formula. Monoisotopic and average masses are returned as a vector of 2 elements. Negative numbers are possible to accommodate with losses or certain modifications.

Example:

  print join(',', massFromComposition('HPO3')), "\n";

Basic mass functions

getMass($el, $mt)

Returns the mass, either monoisotopic or average mass depending on the current setting or the parameter $mt, of elements, amino acids, modifications and molecules.

$el

Contains the name of the "object" whose mass is to return. $el must start with el_ for elements (el_Na, el_H), with aa_ for amino acids (aa_Y, aa_T), with mol_ for molecules (mol_H2O), and mod_ for modifications (mod_Oxidation).

$mt

Contains the mass type (0=monoisotopic, 1=average). When this argument is not provided (most common function use), the current setting of mass type is used.

setMass($symbol, $formula)

This function allows the user to set new molecule, element, amino acid, and modification masses dynamically.

$symbol

Contains the identifier of the new object with the appropriate prefix (el_, aa_, mol_, or mod_).

$formula

Contains the atomic composition (it is used for computing monoisotopic and average masses). Negative numbers of atoms are possible to deal with modifications.

Examples:

setMass('mol_NH3', 'NH3');

setMass('mol_H2O', 'H2O');

setMassType($type)

Sets the current mass type. $type must be equal to 0 (monoisotopic) or 1 (average).

getMassType

Returns the current mass type.

setModif($mr)

add a InSilico::ModRes object

Peptide-related functions

This section groups all the functions that are directly related to peptides, i.e. peptide mass computation and protein digestion.

Modifications

To properly deal with modified proteins (and hence peptides) and compute their mass and MS/MS spectra, in case of peptides, we introduce a convention that allows to localize modifications in protein/peptide sequences.

A protein/peptide sequence is a sequence of amino acids

  a_1 a_2 a_3 a_4 ... a_n.

The corresponding modifications are represented either as a string or as a vector.

String representation

The string takes the form

  m_0:m_1:m_2:m_3:m_4: ... :m_n:m_(n+1),

where m_0 is the N-terminal site modification, m_i is a_i modification, and m_(n+1) is the C-terminal site modification. For instance, a peptide

  DEMSCGHTK

might be modified according to

  ACET_nterm:::Oxidation::Cys_CAM::Oxidation:::

which means that there is an N-terminal acetylation, the methionine and the histidine are oxidized, and the cysteine is carboxyamidomethylated; no C-terminal modification. We see that in this notation empty positions between colons are possible to denote the absence of modification. The modification identifiers come from the configuration file.

In the above string notation it is possible to define variable modifications, see function variablePeptide.

Vector representation

Alternatively, modifications can be localized by using a vector of strings. The length of the vector is len(peptide)+2 and element at index 0 corresponds to the N-terminus, index len(peptide)+1 to the C-terminus and the indices between 1 and len(peptide) correspond to the amino acids. The strings of this vector follow the same rule as the strings between ':' in the modification string, i.e. the contain the name of the modification or nothing, or they define variable modifications, see function variablePeptide.

The PMF case

In PMF only the peptide masses matter and it is not necessary to know the location of the modifications. We only need to know their numbers. Hence, when dealing with PMF computations we introduce a third convention for modification description. We use a vector that contains the number of occurrences and the modifications alternatively:

  num1, modif1, num2, modif2, ...

modifToString($modif, [$len])

In order to display modification strings/vectors conveniently, we provide a unique function modifToString that accepts all three formats and display the modifications in $modif as a string using an appropriate style. If the parameter $len (peptide length) is also given, then modifToString complements the length if the returned string if necessary (MS/MS only).

getPeptideMass(%h)

This function computes the (uncharged) mass of a peptide. The parameters are passed as a reference to a hash to permit named parameters. The named parameters are:

pept

The peptide sequence. If it contains B, Z or X then getPeptideMass returns -1.

modif

A reference to a vector containing the names of the modifications of the peptide. If modif is not specified, the peptide is assumed unmodified. Empty names or undefined names are considered as no modification.

termGain

Mass delta to add to the amino acid masses in case the usual plus water rule does not apply, e.g. when using trypsin to introduce two 18O molecules at the C-terminus.

Example:

  my $peptide = 'QCTIPADFK';
  my @modif = ('Cys_CAM');
  print "$peptide mass is ", getPeptideMass(pept=>$peptide, modif=>\@modif), "\n";

getCorrectCharge($mTheo, $mExp, $delta, $maxCharge)

Returns the appropriate charge and theoretical mass over charge ratio (m/z) for a given experimental m/z ratio and theoretical mass (charge not known). The result is a vector (charge, m/z) and charges between 1 and $maxCharge are tested.

$mTheo

Theoretical mass.

$mExp

Experimental m/z ratio.

$delta

Mass tolerance for comparing $mTheo and $mExp multiplied by the charge.

$maxCharge

Maximum charge to test for (minimum is 1).

digestByRegExp(%h)

Generic digestion function where the enzyme is specified as a Perl regular expression or a InSilicoSpectro::InSilico::CleavEnzyme object. The maximum number of missed cleavages per peptide can be set arbitrarily and the result of the digestion is either (1) a data structure that lists all the peptides with their masses, modifications, and start/stop positions; or (2) a vector of InSilicoSpectro::InSilico::Peptide objects. The choice depends on the type of the parameter protein, see hereafter. A mass window can be specified.

The named parameters are:

protein

The protein sequence. This sequence can be provided as a string or as an object of class InSilicoSpectro::InSilico::AASequence. In the latter case, if no modif parameter is given, then the modifications are taken from the AASequence object, otherwise the parameter modif overrides the modifications stored in the object.

If protein is given as an object then the result of the digestion will be a vector of InSilicoSpectro::InSilico::Peptide objects instead of the data structure.

modif

The parameter giving the localized modifications. If this parameter is not provided, then the protein is assumed to have no localized modification. Localized variable modifications are possible, see function variablePeptide. In case there are localized variable modifications, all the combinations of modifications will be put in the final result, i.e. the peptide is present several times with different modification strings.

Localized modifications are specified according to the two possible format explained above.

fixedModif

A reference to a vector of fixed modification names. The rules for identifying possible amino acids where modifications occur are read from the configuration file. All possible modification sites for all fixed modifications are computed and these sites are combined with the fixed modifications given by the parameter modif.

Fixed modification always have the priority over variable modifications and it is not allowed to have two fixed modifications at the same location.

varModif

A reference to a vector of variable modification names. If this parameter is provided then each generated peptide will be tested for possible variable modifications and all the combinations will be put in the final result, i.e. the peptide is present several times with different modification strings.

pmf

If this parameter is set to any value, then the exact modification location in the result is replaced by the format for PMF, see above.

nmc

Maximum number of missed cleavages, default 0.

minMass

Minimum peptide mass, default 500 Da.

maxMass

Maximum peptide mass, default 999999 Da.

enzyme

The enzyme can be specified either by a regular expression or by a CleavEnzyme object. If this parameter is not provided then the digestion is performed for trypsin according to the regular expression

  (?<=[KR])(?=[^P])

Regular expressions used for specifying an enzyme can be given as strings or precompiled regular expression such as

  $dibasicRegex = qr/(?<=[KR])(?=[KR])/,

which is exported by this module.

When you specify the enzyme by a CleavEnzyme object, the regular expression is read from the object.

CTermGain, NTermGain

In case the enzyme does not add H at N-termini and OH at C-termini, you can define new formulas for what is added.

When the enzyme comes as a CleavEnzyme object, these values are read from the enzyme object directly.

CTermModif, NTermModif

In case the enzyme does introduce a modification at peptide C-/N-termini, you can provide the name of the modification to apply.

When the enzyme comes as a CleavEnzyme object, these values are read from the enzyme object directly.

methionine

If methionine is set to any value then, in case the first protein amino acid is a methionine, supplementary peptides with initial methionine removed are generated. This may be useful when processing RNA-translated sequences.

addProton

This parameter, if set to any value, tells the function to add the mass of one proton to each peptide mass. This is useful for PMF as it allows direct comparison of theoretical (charged) masses with experimental masses.

duplicate

When set to any value, and the protein is given as a AASequence object, this parameter tells the function not to try to save memory when creating the Peptide object.

The result of the digestion is returned either as a vector of InSilicoSpectro::InSilico::Peptide objects or as a data structure which is a vector of six references:

  result[0] -> vector of peptide sequences
  result[1] -> vector of start positions in the original protein string
  result[2] -> vector of end positions in the original protein string
  result[3] -> vector of number of missed cleavages
  result[4] -> vector of peptide masses
  result[5] -> vector of modification descriptions

Variables $digestIndexPept=0, $digestIndexStart=1, $digestIndexEnd=2, $digestIndexNmc=3, $digestIndexMass=4, and $digestIndexModif=5 are exported for convenience.

Example:

  my $protein = 'MCTMACTKGIPRKQWWEMMKPCKADFCV';
  my $modif =   '::Cys_CAM::Oxidation::::::::::::::Oxidation:::::::::::';
  my @result = digestByRegExp(protein=>$protein, modif=>$modif, methionine=>1, nmc=>2);
  print "$protein\n$modif\n";
  for (my $i = 0; $i < @{$result[0]}; $i++){
    print "$result[$digestIndexPept][$i]\t$result[$digestIndexStart][$i]\t",
          "$result[$digestIndexEnd][$i]\t$result[$digestIndexNmc][$i]\t",
          "$result[$digestIndexMass][$i]\t", modifToString($result[$digestIndexModif][$i]), "\n";
  }

More examples in InSilicoSpectro/InSilico/test/testCalcDigest.pl and InSilicoSpectro/InSilico/test/testCalcDigestOOP.pl

nonSpecificDigestion(%h)

Generic digestion function where either no enzyme is used or half enzymatic peptides are computed. The result of the digestion is either (1) a data structure that lists all the peptides with their masses, modifications, and start/stop positions (see digestByRegExp above for a description); or (2) a vector of InSilicoSpectro::InSilico::Peptide objects. The choice depends on the type of the parameter protein, see hereafter.

The number of missed cleavage reported in the data structure is used to distinguish between the three possible cases: -3 for no enzyme peptide, -1 for half-tryptic peptides at their N-term end, and -2 for half-tryptic peptides at their C-term end.

The named parameters are:

protein

See function digestByRegExp.

modif

See function digestByRegExp.

fixedModif

See function digestByRegExp.

varModif

See function digestByRegExp.

pmf

See function digestByRegExp.

minMass

Minimum peptide mass.

maxMass

Maximum peptide mass.

minLen

Minimum peptide length in amino acids.

maxLen

Maximum peptide length in amino acids.

enzyme

If this parameter is not provided then the digestion is performed without any enzyme, i.e. the protein is cut after every amino acid and only the mass window and the peptide length window are applied as criteria to filter possible peptide.

If an enzyme regular expression or a CleavEnzyme object (see digestByRegExp above) is provided, then half-enzymatic peptides are generated, i.e. peptides with either the N- or the C-term end that corresponds to a cleavage site. The mass window and peptide length window also apply in this case.

CTermGain, NTermGain

See function digestByRegExp.

CTermModif, NTermModif

See function digestByRegExp.

addProton

See function digestByRegExp.

duplicate

See function digestByRegExp.

Example:

  my $protein = 'MCTMACTKGIPRKQWWEMMKPCKADFCV';
  my $modif =   '::Cys_CAM::Oxidation::::::::::::::Oxidation:::::::::::';
  my @result = nonSpecificDigestion(protein=>$protein, modif=>$modif);
  print "$protein\n$modif\n";
  for (my $i = 0; $i < @{$result[0]}; $i++){
    print "$result[$digestIndexPept][$i]\t$result[$digestIndexStart][$i]\t",
          "$result[$digestIndexEnd][$i]\t$result[$digestIndexNmc][$i]\t",
          "$result[$digestIndexMass][$i]\t", modifToString($result[$digestIndexModif][$i]), "\n";
  }

More examples in InSilicoSpectro/InSilico/test/testCalcDigest.pl and InSilicoSpectro/InSilico/test/testCalcDigestOOP.pl

matchPMF(%h)

This function compares an experimental PMF spectrum with a theoretical PMF spectrum, i.e. the result of a digestion function, and associates each theoretical mass with one experimental mass.

The named parameters are:

digestResult

A reference to a vector containing the result of the protein digestion as computed by either digestByRegExp or nonSpecificDigestion functions.

expSpectrum

The experimental spectrum is given either as a data structure or as an object of class InSilicoSpectro::Spectra::ExpSpectrum.

The data structure is a vector of references to vectors corresponding to experimental peaks in the experimental spectrum. The parameter expSpectrum is a reference to the experimental spectrum. Namely, expSpectrum has a structure like

  expSpectrum->[0] -> (mass, intensity, s/n, ...) for peak 1
  expSpectrum->[1] -> (mass, intensity, s/n, ...) for peak 2
  expSpectrum->[2] -> (mass, intensity, s/n, ...) for peak 3
  ...

By default, for matchPMF, it is assumed that the mass has index 0 and intensity has index 1 in the peak vectors. The other data in these vectors are not used but they can be retrieved after the spectrum match for computing statistics for instance. If mass index is not 0 or intensity index is not 1, use the parameters massIndex and intensityIndex.

In case expSpectrum is a InSilicoSpectro::Spectra::ExpSpectrum object, the experimental spectrum as well as mass and intensity indices are retrieved from the object directly.

If the peptide mass fingerprinting spectrum has been acquired on an ESI instrument, where peptides can be multiply charged, it is the user responsibility to replace multiply charged peptides m/z values by their singly charged equivalent. The match algorithm ignores peptide charges.

tol

Relative mass error tolerance; this parameter is optional. When not specified, the closest experimental peak is returned for each theoretical mass and any tolerance can be applied afterwards, i.e. outside of the function matchPMF. When specified, the most intense peak found with mass error satisfying

  relative error <= tol OR absolute error <= minTol

is returned; if no peak is found then the value in the returned vector remains undefined.

minTol

Absolute mass error, default value 0.1 Da. This parameter is used only in case tol parameter is specified, see above.

sorted

The experimental spectrum must sorted with respect to the mass for matchPMFClosest to work. In case it is already sorted when the function is called, sorted can be set to any value to avoid an unnecessary sort operation.

massIndex

Mass index in the experimental peak vectors, default 0.

intensityIndex

Intensity index in the experimental peak vectors, default 1.

The result of the matching is a vector of the same length as digestResult that contains references to the closest peaks vector in expSpectrum. The reference to the closest experimental peak for the peptide described at index i in digestResult is found at index i in the returned vector.

Examples in InSilicoSpectro/InSilico/test/testCalcPMFMatch.pl and InSilicoSpectro/InSilico/test/testCalcPMFMatchOOP.pl.

variablePeptide(%h)

Given a peptide sequence, fixed and localized variable modifications, and lists of variable and fixed modifications, this function returns a list comprising all possible modification combinations. Each amino acid can only receive one modification at a time and if several are possible for a given amino acid they will yield several distinct modification combinations. In case you need an amino acid to be modified simultaneously by several modifications, you can define this modifications combination as a new modification.

The peptide can given by a string, a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object.

The parameters are:

pept

The peptide.

modif

A reference to a vector (no string possible here) containing fixed and localized variable modifications for the peptide. If this parameter is not provided, then no such modifications are considered.

Fixed modification are always present whereas localized variable modifications are associated with a specific amino acid but may be present or not. Add '(*)' before a modification name to specify a localized variable modification. It is even possible to give several alternative variable modifications for the same amino acid: add '(*)' before the name of the first one and add the other modifications names separated by comas:

  (*)mod1,mod2,mod3

If the peptide is provided as a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object and the parameter modif is not given, then the localized modifications are taken from the object.

fixedModif

See function digestByRegExp.

varModif

See function digestByRegExp.

pmf

See function digestByRegExp. The total mass shift corresponding to a modification combination follows it in the returned vector.

Examples in InSilicoSpectro/InSilico/test/testCalcVarpept.pl and InSilicoSpectro/InSilico/test/testCalcVarpeptOOP.pl.

locateModif($peptSeq, $modif, $fixedModif, $varModif, $modifVect)

This function is used to generate a vector of modifications on the basis of localized fixed and variable modifications as well as a list of fixed and variable modifications, which are located by using their rule. The result is returned in $modifVect.

The parameters are:

$peptSeq

The peptide sequence.

$modif

The localized modifications, see function variablePeptide.

$fixedModif

A reference to a vector containing the name of fixed modifications.

$varModif

A reference to a vector containing the name of variable modifications.

$modifVect

A reference to a vector that will contain the result.

Example: see InSilicoSpectro/InSilico/test/testCalcDigest.pl and the mini web site.

Fragment mass functions

Fragment types are read from the configuration file. Supplementary fragments can be added subsequently by using the function setLoss, setSeries, and setFragType. Each fragment type is described by three main properties:

Ion series

A fragment ion series is the basic properties required for computing theoretical masses. Classical ion series are a,b,c and x,y,z. They define the generic position where the peptide is fragmented.

Certain very short fragments, such as b1, or very long fragments, such as bn, are normally not produced or detected. Hence an ion series also defines the first and last possible fragments.

Charge state

A fragment charge state can be 1,2,3,...

Neutral losses

Certain amino acids have the potential to loose certain molecules during the fragmentation process. To precisely define a fragment type one must specify if loss(es) is(are) possible and, if yes, which one(s) and with which multiplicity.

For instance, a fragment type b-H2O can be defined that is able to loose water once. Another example is b-H2O* that is allowed to loose water as many times as possible, depending on the peptide sequences (the maximum number of loss is limited by the number of amino acids able to loose water in the peptide sequence).

It is even possible to combine losses: b-NH3*-H2O* is a fragment type that allows for multiple water and ammonia losses.

getFragmentMasses(%h)

This function computes the theoretical fragmentation spectrum (MS/MS spectrum) of a peptide. Namely, the peptide can given by a string, a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object Top-down proteomics).

The named parameters are:

pept

The peptide.

modif

The peptide localized modifications. Can be a string or a reference to a vector, see function digestByRegExp. When not provided the peptide is assumed unmodified.

Only fixed (and localized) modifications make sense in the context of fragment mass computations.

If the peptide is provided as a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object and the parameter modif is not given, then the localized modifications are taken from the object.

fragTypes

A reference to a vector that gives the list of fragment types to use for computing the theoretical spectrum.

spectrum

A reference to a hash that is used for storing the theoretical spectrum. This hash is emptied before computation to remove a possible previous spectrum.

The data structure containing the theoretical spectrum is as follows.

  $spectrum{peptide}     = peptide sequence or Peptide object (depends on the given parameter)
  $spectrum{modif}       = reference to the vector of modifications actually used for the computation
  $spectrum{peptideMass} = peptide mass
  $spectrum{mass}        = theoretical spectrum
  $spectrum{ionType}     = ion types

$spectrum{mass} is a reference to another hash whose keys are either term for N- or C-terminal fragments, or internal for internal fragments such as immonium ions.

$spectrum{mass}{term} is a reference to a hash whose keys are the computed fragment types. For instance, we assume that b,y,b-H2O* were computed. We have

  $spectrum{mass}{term}{b}        = theoretical b fragment masses
  $spectrum{mass}{term}{y++}      = theoretical y++ fragment masses
  $spectrum{mass}{term}{b-H2O*}   = theoretical b-H2O* fragment masses

In case of fragment types without loss the hash

  $spectrum{mass}{term}{fragment_type}

is a reference to a vector of length n, where n is the length of the peptide sequence. For a N-terminal fragment type (a,b,c), index i in this vector correspond to a fragment that includes the i-th amino acid as its last amino acid. For a C-terminal fragment type (x,y,z), index i in this vector corresponds to a fragment that includes the (n-i)-th amino acid as its last amino acid. unused positions (too short or too long fragments) are left undefined.

Losses

In case of fragment with loss, it is possible that more than n positions are required to store the theoretical spectrum. We note that by convention fragment types with loss (b-H2O, b-H2O*) must include one loss at least. That is they do not include the masses corresponding to the fragment type without loss (b).

Simultaneous losses at one amino acid are not permitted. Such multiple losses can be caused by defining a fragment type with several possible losses and at least two of these losses are possible on the same amino acid. In case you really need multiple losses, you can define a new fragment type that only happens at the common amino acid and induces the total mass shift.

If we find no possible loss in a given peptide sequence for a given fragment type, then the reference

  $spectrum{mass}{term}{fragment_type}

remains undefined. If only one position for potential loss exists or the fragment type limits the number of loss to one (b-NH3), then the length of the theoretical spectrum will be n as for fragments without loss. In case several losses are possible and several positions for loss are found, the the length of the vector referen- ced by $spectrum{mass}{term}{fragment_type} is a multiple of n. For instance for two losses we would have the masses with one loss in the cells 0...n-1 and the masses with two losses in the cells n...2n-1. Recall that impossible fragments are represented by undefined cells.

In order to display theoretical spectra it is useful to be able to keep track of the exact number of losses, especially when several types of losses are possible simultaneously. Thus we store in

  $spectrum{ionType}

strings that describe the exact type of ions at hand. For instance b++-2(H2O)-NH3 to describe a doubly charged b fragment with two water losses and one ammonia loss (its fragment type b++-H2O*-NH3* does not provide this information). The ion types are listed by fragment types as are the theoretical masses

  $spectrum{ionType}{b}
  $spectrum{ionType}{y++}
  $spectrum{ionType}{b-H2O*}

For fragment types without loss we simply copy the fragment type

  $spectrum{ionType}{b}[0]   = 'b'
  $spectrum{ionType}{y++}[0] = 'y++'

whereas for fragment types with loss, we compute the appropriate strings and store them as a vector referenced by

  $spectrum{ionType}{fragment_type}.

Immonium ions

Immonium ions are internal ions that result from a y_m/a_n cleavages. By specifying 'immo' in the list of ion types, the masses corresponding to the immonium ions of the amino acids present in the peptide are computed and stored in a hash

  $spectrum{mass}{intern}{immo}

whose keys are the one letter codes of the residues and the values are the masses.

The configuration file contains the information about which are the amino acids that yield such ions as well as the mass delta to apply to the amino acid mass. Moreover, modified cysteines as well as modified methionine and histidine are included in the immonium ions with adapted mass. For Lysine, both its "nominal" immonium ion mass (101) and after ammonia loss (84) are computed. Immonium ions for the N- and C-terminal amino acids are not computed.

Example

my %spectrum; $peptide = 'HCMSKPQMLR'; $modif = '::Cys_CAM::::::Oxidation:::'; getFragmentMasses(pept=>$peptide, modif=>$modif, fragTypes=>['b','a', 'b-NH3*','b-H2O*','b++','y','y-NH3*','y-H2O*','y++','immo'], spectrum=>\%spectrum); print "\nfragments ($spectrum{peptideMass}):\n"; my $len = length($peptide); foreach my $frag (keys(%{$spectrum{mass}{term}})){ for (my $i = 0; $i < @{$spectrum{ionType}{$frag}}; $i++){ print $spectrum{ionType}{$frag}[$i]; for (my $j = $i*$len; $j < ($i+1)*$len; $j++){ print "\t", $spectrum{mass}{term}{$frag}[$j]+0.0; } print "\n"; } } foreach my $frag (keys(%{$spectrum{mass}{intern}})){ print $spectrum{ionType}{$frag}[0]; foreach my $aa (keys(%{$spectrum{mass}{intern}{$frag}})){ print "\t$aa\t$spectrum{mass}{intern}{$frag}{$aa}"; } print "\n"; }

More examples in InSilicoSpectro/InSilico/test/testCalcFrag.pl and InSilicoSpectro/InSilico/test/testCalcFragOOP.pl.

matchSpectrumClosest(%h)

This function compares an experimental MS/MS spectrum with a theoretical MS/MS spectrum and associates each theoretical mass with its closest experimental mass. The application of any mass tolerance can be done afterwards.

The named parameters are:

pept

The peptide sequence or a Peptide object. When not provided the spectrum is considered to be already computed and ready for use in spectrum.

modif

The modification string. When not provided the peptide is assumed unmodified. If the peptide is provided as a Peptide object, then the modifications are read from the Peptide object if no modif parameter is set, otherwise it overrides the object data.

fragTypes

A reference to a vector that gives the list of fragment types to use for computing the theoretical spectrum.

spectrum

The theoretical spectrum.

expSpectrum

The experimental spectrum, a data structure or an object InSilicoSpectro::Spectra::ExpSpectrum, see method PMFMatch.

sorted

The experimental spectrum must sorted with respect to the mass for matchSpectrumClosest to work. In case it is already sorted when the function is called, sorted can be set to any value to avoid an unnecessary sort operation.

massIndex

The mass index in the experimental peak vectors, default 0.

The result of the matching is an addition to the spectrum data structure. A new structure parallel to spectrum->{mass} is created that is named

  spectrum->{match}

that references a hash analogous to spectrum->{mass} but with the theoretical masses replaced by references to the closest peaks vector in expSpectrum.

Examples in InSilicoSpectro/InSilico/test/testCalcMatch.pl and InSilicoSpectro/InSilico/test/testCalcMatchOOP.pl.

matchSpectrumGreedy(%h)

Alternative algorithm for matching experimental and theoretical spectra. The experimental spectrum must have the masses at index 0 and the intensities at index 1 in the peak vectors. The mass error tolerance is determined by the rule

  relative error <= tol OR absolute error <= minTol

Four named parameters are available in addition to the ones of matchSpectrumClosest:

tol

Relative mass tolerance (default 500 ppm).

minTol

Minimum absolute mass tolerance (default 0.2 Da).

order

A reference to a vector that gives the fragment types in an order of priority for matching.

intensityIndex

The intensity index in the experimental peak vectors, default 1.

If order is not set, then matchSpectrumGreedy returns the matching experimental peaks within mass tolerance tol with the highest intensities. That is when several experimental masses are possible, the one corresponding to the most intense peak is chosen, although it is not necessarily the closest.

If order is set, then a greedy algorithm is applied additionally. Namely, the fragment types are processed in the order given by this parameter and the most intense peaks are chosen as before but a chosen peak is no longer available for subsequent matches. This ensures that experimental masses are used once only which is not true for the case with no order and for matchSpectrumClosest.

Examples in InSilicoSpectro/InSilico/test/testCalcMatch.pl and InSilicoSpectro/InSilico/test/testCalcMatchOOP.pl.

getSeries($name)

Returns a vector (terminus, monoisotopic delta, average delta, first fragment, last fragment) that contains the parameters of series $name, where

$name

is th name of the series, e.g. b or y.

terminus

is equal to 'N' or 'C'.

monoisotopic delta

is the mass delta to apply when computing the monoisotopic mass. It is 0 for a b series and 1.007825 for y.

average delta

is the mass delta to apply when computing the average mass. It is 0 for a b series and 1.007976 for y.

first fragment

is the number of the first fragment to compute. For instance, fragment b1 is generally not detected hence first fragment should be set to 2 for a b series. The rule is the same for N- and C-term series.

last fragment

is the number of the last fragment counted from the end. For instance, the last b fragment is normally not detected hence last fragment should be set to 2. If a fragment containing all the amino acids of the peptide is possible, then it should be set to 1. The rule is the same for N- and C-term series.

setSeries($name, $terminus, $formula, $firstFrag, $lastFrag)

Adds a new series to the series read from the configuration file. The parameters are defined above in getSeries except formula. The mass deltas are not given as real numbers but rather as deltas in atoms. For instance, b series formula is empty because we do want 0 delta and c series formula is 'N 1 H 3'. Example:

  setSeries('c', 'N', 'N 1 H 3', 2, 2);

getLoss($name)

Returns a vector (residues, monoisotopic delta, average delta) that contains the parameters of a neutral loss, where

$name

is the name of the neutral loss.

residues

is a string containing the one character codes of the amino acids where the loss is possible.

monoisotopic delta

is the mass delta to apply when computing the monoisotopic mass.

average delta

is the mass delta to apply when computing the average mass.

setLoss($name, $residues, $formula) =head2 setLoss($name, $residues, $deltamass_mono, $deltamass_avg)

Adds a new loss to the ones read from the file fragments.xml. The parameters are defined above in getLoss except formula. The mass deltas are not given as real numbers but rather as deltas in atoms. Example:

  setLoss('H2O', 'ST', 'H 2 O 1');
or 
  setLoss('H2O', 'ST', 18.003, 17.927);

getFragType($name)

Returns a vector (series, charge, loss 1, repeat 1, loss 2, repeat 2, ...) containing the parameters of a fragment type, where

$name

is the name of the fragment type.

series

is the series on which this fragment type is based.

charge

is the charge state of a fragment of this type.

loss k

in case the fragment type includes losses, loss k is set to the name of each loss possible for this fragment type.

repeat k

the maximum number of each loss, -1 means no maximum.

getFragTypeList

Returns a vector containing the list of all fragment type names (in arbitrary order).

setFragType($name, $series, $charge, $loss, $repeat)

Adds a new fragment type to the ones read from the file fragments.xml. The parameters $name, $charge and $series are defined above in getFragType.

$loss

is the reference of a vector containing the names of the possible losses for this fragment type. In case no loss is possible then $loss may be let undefined or references an empty vector.

$repeat

gives the repetitions of the losses listed in vector $loss. If $loss is undefined or references an empty vector, then $repeat is ignored.

Example:

  setFragType('y++-H2O*-NH3(2)', 'y', 2, ['H2O', 'NH3'], [-1, 2]);

defines a fragment type made of doubly charged y fragments with a maximum of 2 ammonia losses and an arbitrary number of water losses.

setImmonium($residues, $delta)

Replaces the definition of immonium ion parameters read from the file fragments.xml.

$residues

is a string containing the one letter codes of the residues that yield detectable immonium ions.

$delta

is the mass delta to obtain the immonium ion mass from its amino acid mass.

getImmonium

Returns a vector (residues, delta); the parameters residues and delta are defined in setImmoniumn above.

getImmoniumMass($aa, $mod)

Computes the mass of an immonium ion given the amino acid one letter code $aa and a possible modification $mod. In case of Lysine (K), two values are returned.

EXAMPLES

See programs starting with testCalc in folder InSilicoSpectro/InSilico/test/.

AUTHORS

Jacques Colinge, Upper Austria University of Applied Science at Hagenberg