Efficient algorithm to generate an isotope distribution #1023

ctapobep · 2023-11-08T22:27:39Z

CDK's IsotopePatternGenerator was too slow for large molecules since it's a $\theta(n^2)$ algorithm. So we've implemented our own library elsci-io/isotope-distribution with a loglinear complexity. That lib is a CDK-aware wrapper around a domain-agnostic algorithm inside elsci-io/multinomial-selection.

Question is - are you guys interested in having this code incorporated in CDK? It's a different API in a sense that it doesn't return the whole list of possible isotopologues as IsotopePatternGenerator does. Rather it returns an iterator that descends from the most probable isotope down to the least probable one. And the client code can decide if it wants to stop before reaching the end.

So if you're interested in making it part of CDK, I don't think we can/should make it compatible with IsotopePatternGenerator - rather it could be a completely different set of classes. There's also a question of Java Modules (#622), so it makes sense to put new code into separate packages anyway. And keep IsotopePatternGenerator for backward compatibility.

The text was updated successfully, but these errors were encountered:

johnmay · 2023-11-09T09:03:21Z

Possibly and sounds interesting, is it similar to the MostAbundant functionality.

In this case we just care about generating the most-abundant mol weight rather than actually generating all the isotopes:

Core of the algorithm:
https://github.com/cdk/cdk/blob/main/base/standard/src/main/java/org/openscience/cdk/tools/manipulator/AtomContainerManipulator.java#L364C1-L384C6

Slide 12 hee: https://nextmovesoftware.com/talks/Sayle_SimpleCheminformaticsProblems_RDKITUGM_201809.pdf

ctapobep · 2023-11-09T17:01:50Z

Huh, didn't know it existed! It does look similar to how we start looking for the most abundant isotopologue. But CDK gives incorrect results in some cases. E.g. if you take $S19$, its 2 most abundant isotopologues are:

$^{32}S_{19}$ with the probability $.9493^{19} = 0.3721$
$^{32}S_{18} \text{} ^{34}S_1$ with the probability $.9493^{18}.0429 \frac{19!}{18!1!} = 0.3195$

And.. the algorithm in CDK returns the 2nd isotopologue as the most abundant one. The problem is this line. The idea behind that code is that if you have red balls (probability=0.9) and blue balls (probability=0.1), and if you sample 10 balls then we get a clean result with the combination 9r1b as the most probable. We can see that $0.1 \times 10 \geq 1$ which means we expect to get 1 blue ball. However! If the numbers aren't that pretty (e.g. r=0.95 and b=0.05), then we can't easily tell whether it's going to be 9r1b or 10r - it will depend on whether $10*.95=9.5$ is closer to 10 or $10*.05=0.5$ is closer to 1 (these quantities can't necessarily be compared directly, I didn't fully think it through).

That's why in our algorithm we go a little further - after calculating the initial guess (similar to how CDK does it) we modify the combination to see if the "neighboring" combination has higher probability. It's not mathematical, but it's a simple hack that works :) And we need that code (looking at "neighboring" isotopologues) anyway to further iterate to the less probable combinations.

johnmay · 2023-11-10T07:32:08Z

That sounds like a reasonable approach, and not difficult to make the required modifications. I still think the lazy generator would be useful, how much code is it?

johnmay · 2023-11-10T07:43:25Z

We should at least provide the option to get the Isotopic MF out of the distribution. There is a mirror of this code for the formula.

        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        IMolecularFormula mf =
                MolecularFormulaManipulator.getMolecularFormula("C202H315N55O64S6",
                                                                bldr);
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeight),
                          closeTo(4730.397, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeightIgnoreSpecified),
                          closeTo(4730.397, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MonoIsotopic),
                          closeTo(4727.140, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MostAbundant),
                          closeTo(4729.147, 0.001));

johnmay · 2023-11-10T07:46:05Z

NVM I forgot there is away to get the formula out:

IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("C"), 6);
mf.addIsotope(new Atom("Br"), 6);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
assertThat(MolecularFormulaManipulator.getString(mamf, false, true),
           is("[12]C6[79]Br3[81]Br3"));

Confirming your observation:

IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("S"), 19);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
System.out.println(MolecularFormulaManipulator.getString(mamf, false, true)); // [32]S18[34]S

johnmay · 2023-11-10T12:50:15Z

Scratching our in heads in the office I think you second probability is slightly off. In your second case 0.9493 + 0.0429 does not add to 1. If you take 1-.9493 = 0.0507 then the second is now more probable. If there was only two isotopes then that would be correct but you have one other atom which must be 33, 34, or 36S and is likely to be more common.

$^{32}S_{19}$ with the probability $.9493^{19} = 0.3721$
$^{32}S_{18} \text{} ^{34}S_1$ with the probability $.9493^{18}(1-.9493) \frac{19!}{18!1!} = 0.377$

@rogersayle

johnmay · 2023-11-10T12:53:15Z

It's easier to work with: https://github.com/cdk/cdk/blob/main/tool/formula/src/main/java/org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java#L1598C28-L1622.

If we modified the algorithm as though there were only two isotopes you get your answer:

    private static boolean addIsotopeDist(IMolecularFormula mf,
                                          IIsotope[] isotopes,
                                          int idx, int count) {
        if (count == 0)
            return true;
        double frac = 100d;
        for (int i = 0; i < Math.max(idx, 2); i++) // if there were two isotopes
//        for (int i = 0; i < idx; i++)
            frac -= isotopes[i].getNaturalAbundance();
        double p = isotopes[idx].getNaturalAbundance() / frac;

        if (p >= 1.0) {
            mf.addIsotope(isotopes[idx], count);
            return true;
        }

        double kMin = (count + 1) * (1 - p) - 1;
        double kMax = (count + 1) * (1 - p);
        if ((int) Math.ceil(kMin) == (int) Math.floor(kMax)) {
            int k = (int) kMax;
            mf.addIsotope(isotopes[idx], count - k);
            // recurse with remaining
            return addIsotopeDist(mf, isotopes, idx + 1, k);
        }
        return false; // multiple are most abundant
    }

ctapobep · 2023-11-10T13:17:08Z

I still think the lazy generator would be useful, how much code is it?

So there are 2 different libs - elsci-io/multinomial-selection contains the actual math & algorithm. Since it's just combinatorics & probabilities w/o any chemical context, we decided to separate it out. Who knows maybe we'll need it for something other than isotopes. So we could create a module in CDK that depends on that lib, and basically copies the code from elsci-io/isotope-distribution - which contains 5 small classes.

Scratching our in heads in the office I think you second probability is slightly off. In your second case 0.9493 + 0.429 does not add to 1.

Yep, $S$ has 24 isotopes in CDK. The first 2 contribute to $0.9493 + 0.0429 = 0.9922$ (notice it's not $0.429$). Try using $0.0429$, because $1-0.9493$ is the probability of all other isotopes combined except for the 1st one. We could of course use all 24 isotopes in the formula, but they don't matter because there are 0 elements of those - so both $p^0$ and $0!$ end up to be 1 and so they won't have any impact on the result.

johnmay · 2023-11-10T14:47:15Z

Yep, $S$ has 24 isotopes in CDK. The first 2 contribute to $0.9493 + 0.0429 = 0.9922$ (notice it's not $0.429$). Try using $0.0429$, because $1-0.9493$ is the probability of all other isotopes combined except for the 1st one. We could of course use all 24 isotopes in the formula, but they don't matter because there are 0 elements of those - so both $p^0$ and $0!$ end up to be 1 and so they won't have any impact on the result.

If only we could create S₁₉ an measure the peak :-).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient algorithm to generate an isotope distribution #1023

Efficient algorithm to generate an isotope distribution #1023

ctapobep commented Nov 8, 2023 •

edited

johnmay commented Nov 9, 2023

ctapobep commented Nov 9, 2023 •

edited

johnmay commented Nov 10, 2023

johnmay commented Nov 10, 2023

johnmay commented Nov 10, 2023 •

edited

johnmay commented Nov 10, 2023 •

edited

johnmay commented Nov 10, 2023

ctapobep commented Nov 10, 2023 •

edited

johnmay commented Nov 10, 2023

Efficient algorithm to generate an isotope distribution #1023

Efficient algorithm to generate an isotope distribution #1023

Comments

ctapobep commented Nov 8, 2023 • edited

johnmay commented Nov 9, 2023

ctapobep commented Nov 9, 2023 • edited

johnmay commented Nov 10, 2023

johnmay commented Nov 10, 2023

johnmay commented Nov 10, 2023 • edited

johnmay commented Nov 10, 2023 • edited

johnmay commented Nov 10, 2023

ctapobep commented Nov 10, 2023 • edited

johnmay commented Nov 10, 2023

ctapobep commented Nov 8, 2023 •

edited

ctapobep commented Nov 9, 2023 •

edited

johnmay commented Nov 10, 2023 •

edited

johnmay commented Nov 10, 2023 •

edited

ctapobep commented Nov 10, 2023 •

edited