New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient algorithm to generate an isotope distribution #1023
Comments
Possibly and sounds interesting, is it similar to the MostAbundant functionality. In this case we just care about generating the most-abundant mol weight rather than actually generating all the isotopes: Core of the algorithm: Slide 12 hee: https://nextmovesoftware.com/talks/Sayle_SimpleCheminformaticsProblems_RDKITUGM_201809.pdf |
Huh, didn't know it existed! It does look similar to how we start looking for the most abundant isotopologue. But CDK gives incorrect results in some cases. E.g. if you take
And.. the algorithm in CDK returns the 2nd isotopologue as the most abundant one. The problem is this line. The idea behind that code is that if you have red balls (probability=0.9) and blue balls (probability=0.1), and if you sample 10 balls then we get a clean result with the combination 9r1b as the most probable. We can see that That's why in our algorithm we go a little further - after calculating the initial guess (similar to how CDK does it) we modify the combination to see if the "neighboring" combination has higher probability. It's not mathematical, but it's a simple hack that works :) And we need that code (looking at "neighboring" isotopologues) anyway to further iterate to the less probable combinations. |
That sounds like a reasonable approach, and not difficult to make the required modifications. I still think the lazy generator would be useful, how much code is it? |
We should at least provide the option to get the Isotopic MF out of the distribution. There is a mirror of this code for the formula. IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
IMolecularFormula mf =
MolecularFormulaManipulator.getMolecularFormula("C202H315N55O64S6",
bldr);
org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeight),
closeTo(4730.397, 0.001));
org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeightIgnoreSpecified),
closeTo(4730.397, 0.001));
org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MonoIsotopic),
closeTo(4727.140, 0.001));
org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MostAbundant),
closeTo(4729.147, 0.001)); |
NVM I forgot there is away to get the formula out: IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("C"), 6);
mf.addIsotope(new Atom("Br"), 6);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
assertThat(MolecularFormulaManipulator.getString(mamf, false, true),
is("[12]C6[79]Br3[81]Br3")); Confirming your observation: IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("S"), 19);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
System.out.println(MolecularFormulaManipulator.getString(mamf, false, true)); // [32]S18[34]S |
Scratching our in heads in the office I think you second probability is slightly off. In your second case 0.9493 + 0.0429 does not add to 1. If you take 1-.9493 = 0.0507 then the second is now more probable. If there was only two isotopes then that would be correct but you have one other atom which must be 33, 34, or 36S and is likely to be more common.
|
It's easier to work with: https://github.com/cdk/cdk/blob/main/tool/formula/src/main/java/org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java#L1598C28-L1622. If we modified the algorithm as though there were only two isotopes you get your answer: private static boolean addIsotopeDist(IMolecularFormula mf,
IIsotope[] isotopes,
int idx, int count) {
if (count == 0)
return true;
double frac = 100d;
for (int i = 0; i < Math.max(idx, 2); i++) // if there were two isotopes
// for (int i = 0; i < idx; i++)
frac -= isotopes[i].getNaturalAbundance();
double p = isotopes[idx].getNaturalAbundance() / frac;
if (p >= 1.0) {
mf.addIsotope(isotopes[idx], count);
return true;
}
double kMin = (count + 1) * (1 - p) - 1;
double kMax = (count + 1) * (1 - p);
if ((int) Math.ceil(kMin) == (int) Math.floor(kMax)) {
int k = (int) kMax;
mf.addIsotope(isotopes[idx], count - k);
// recurse with remaining
return addIsotopeDist(mf, isotopes, idx + 1, k);
}
return false; // multiple are most abundant
} |
So there are 2 different libs - elsci-io/multinomial-selection contains the actual math & algorithm. Since it's just combinatorics & probabilities w/o any chemical context, we decided to separate it out. Who knows maybe we'll need it for something other than isotopes. So we could create a module in CDK that depends on that lib, and basically copies the code from elsci-io/isotope-distribution - which contains 5 small classes.
Yep, |
If only we could create S19 an measure the peak :-). |
CDK's$\theta(n^2)$ algorithm. So we've implemented our own library elsci-io/isotope-distribution with a loglinear complexity. That lib is a CDK-aware wrapper around a domain-agnostic algorithm inside elsci-io/multinomial-selection.
IsotopePatternGenerator
was too slow for large molecules since it's aQuestion is - are you guys interested in having this code incorporated in CDK? It's a different API in a sense that it doesn't return the whole list of possible isotopologues as
IsotopePatternGenerator
does. Rather it returns an iterator that descends from the most probable isotope down to the least probable one. And the client code can decide if it wants to stop before reaching the end.So if you're interested in making it part of CDK, I don't think we can/should make it compatible with
IsotopePatternGenerator
- rather it could be a completely different set of classes. There's also a question of Java Modules (#622), so it makes sense to put new code into separate packages anyway. And keepIsotopePatternGenerator
for backward compatibility.The text was updated successfully, but these errors were encountered: