Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient algorithm to generate an isotope distribution #1023

Open
ctapobep opened this issue Nov 8, 2023 · 9 comments
Open

Efficient algorithm to generate an isotope distribution #1023

ctapobep opened this issue Nov 8, 2023 · 9 comments

Comments

@ctapobep
Copy link

ctapobep commented Nov 8, 2023

CDK's IsotopePatternGenerator was too slow for large molecules since it's a $\theta(n^2)$ algorithm. So we've implemented our own library elsci-io/isotope-distribution with a loglinear complexity. That lib is a CDK-aware wrapper around a domain-agnostic algorithm inside elsci-io/multinomial-selection.

Question is - are you guys interested in having this code incorporated in CDK? It's a different API in a sense that it doesn't return the whole list of possible isotopologues as IsotopePatternGenerator does. Rather it returns an iterator that descends from the most probable isotope down to the least probable one. And the client code can decide if it wants to stop before reaching the end.

So if you're interested in making it part of CDK, I don't think we can/should make it compatible with IsotopePatternGenerator - rather it could be a completely different set of classes. There's also a question of Java Modules (#622), so it makes sense to put new code into separate packages anyway. And keep IsotopePatternGenerator for backward compatibility.

@johnmay
Copy link
Member

johnmay commented Nov 9, 2023

Possibly and sounds interesting, is it similar to the MostAbundant functionality.

In this case we just care about generating the most-abundant mol weight rather than actually generating all the isotopes:

Core of the algorithm:
https://github.com/cdk/cdk/blob/main/base/standard/src/main/java/org/openscience/cdk/tools/manipulator/AtomContainerManipulator.java#L364C1-L384C6

Slide 12 hee: https://nextmovesoftware.com/talks/Sayle_SimpleCheminformaticsProblems_RDKITUGM_201809.pdf

@ctapobep
Copy link
Author

ctapobep commented Nov 9, 2023

Huh, didn't know it existed! It does look similar to how we start looking for the most abundant isotopologue. But CDK gives incorrect results in some cases. E.g. if you take $S19$, its 2 most abundant isotopologues are:

  • $^{32}S_{19}$ with the probability $.9493^{19} = 0.3721$
  • $^{32}S_{18} \text{} ^{34}S_1$ with the probability $.9493^{18}.0429 \frac{19!}{18!1!} = 0.3195$

And.. the algorithm in CDK returns the 2nd isotopologue as the most abundant one. The problem is this line. The idea behind that code is that if you have red balls (probability=0.9) and blue balls (probability=0.1), and if you sample 10 balls then we get a clean result with the combination 9r1b as the most probable. We can see that $0.1 \times 10 \geq 1$ which means we expect to get 1 blue ball. However! If the numbers aren't that pretty (e.g. r=0.95 and b=0.05), then we can't easily tell whether it's going to be 9r1b or 10r - it will depend on whether $10*.95=9.5$ is closer to 10 or $10*.05=0.5$ is closer to 1 (these quantities can't necessarily be compared directly, I didn't fully think it through).

That's why in our algorithm we go a little further - after calculating the initial guess (similar to how CDK does it) we modify the combination to see if the "neighboring" combination has higher probability. It's not mathematical, but it's a simple hack that works :) And we need that code (looking at "neighboring" isotopologues) anyway to further iterate to the less probable combinations.

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

That sounds like a reasonable approach, and not difficult to make the required modifications. I still think the lazy generator would be useful, how much code is it?

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

We should at least provide the option to get the Isotopic MF out of the distribution. There is a mirror of this code for the formula.

        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        IMolecularFormula mf =
                MolecularFormulaManipulator.getMolecularFormula("C202H315N55O64S6",
                                                                bldr);
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeight),
                          closeTo(4730.397, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MolWeightIgnoreSpecified),
                          closeTo(4730.397, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MonoIsotopic),
                          closeTo(4727.140, 0.001));
        org.hamcrest.MatcherAssert.assertThat(MolecularFormulaManipulator.getMass(mf, MostAbundant),
                          closeTo(4729.147, 0.001));

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

NVM I forgot there is away to get the formula out:

IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("C"), 6);
mf.addIsotope(new Atom("Br"), 6);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
assertThat(MolecularFormulaManipulator.getString(mamf, false, true),
           is("[12]C6[79]Br3[81]Br3"));

Confirming your observation:

IMolecularFormula mf = new MolecularFormula();
mf.addIsotope(new Atom("S"), 19);
IMolecularFormula mamf = MolecularFormulaManipulator.getMostAbundant(mf);
System.out.println(MolecularFormulaManipulator.getString(mamf, false, true)); // [32]S18[34]S

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

Scratching our in heads in the office I think you second probability is slightly off. In your second case 0.9493 + 0.0429 does not add to 1. If you take 1-.9493 = 0.0507 then the second is now more probable. If there was only two isotopes then that would be correct but you have one other atom which must be 33, 34, or 36S and is likely to be more common.

  • $^{32}S_{19}$ with the probability $.9493^{19} = 0.3721$
  • $^{32}S_{18} \text{} ^{34}S_1$ with the probability $.9493^{18}(1-.9493) \frac{19!}{18!1!} = 0.377$

@rogersayle

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

It's easier to work with: https://github.com/cdk/cdk/blob/main/tool/formula/src/main/java/org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java#L1598C28-L1622.

If we modified the algorithm as though there were only two isotopes you get your answer:

    private static boolean addIsotopeDist(IMolecularFormula mf,
                                          IIsotope[] isotopes,
                                          int idx, int count) {
        if (count == 0)
            return true;
        double frac = 100d;
        for (int i = 0; i < Math.max(idx, 2); i++) // if there were two isotopes
//        for (int i = 0; i < idx; i++)
            frac -= isotopes[i].getNaturalAbundance();
        double p = isotopes[idx].getNaturalAbundance() / frac;

        if (p >= 1.0) {
            mf.addIsotope(isotopes[idx], count);
            return true;
        }

        double kMin = (count + 1) * (1 - p) - 1;
        double kMax = (count + 1) * (1 - p);
        if ((int) Math.ceil(kMin) == (int) Math.floor(kMax)) {
            int k = (int) kMax;
            mf.addIsotope(isotopes[idx], count - k);
            // recurse with remaining
            return addIsotopeDist(mf, isotopes, idx + 1, k);
        }
        return false; // multiple are most abundant
    }

@ctapobep
Copy link
Author

ctapobep commented Nov 10, 2023

I still think the lazy generator would be useful, how much code is it?

So there are 2 different libs - elsci-io/multinomial-selection contains the actual math & algorithm. Since it's just combinatorics & probabilities w/o any chemical context, we decided to separate it out. Who knows maybe we'll need it for something other than isotopes. So we could create a module in CDK that depends on that lib, and basically copies the code from elsci-io/isotope-distribution - which contains 5 small classes.

Scratching our in heads in the office I think you second probability is slightly off. In your second case 0.9493 + 0.429 does not add to 1.

Yep, $S$ has 24 isotopes in CDK. The first 2 contribute to $0.9493 + 0.0429 = 0.9922$ (notice it's not $0.429$). Try using $0.0429$, because $1-0.9493$ is the probability of all other isotopes combined except for the 1st one. We could of course use all 24 isotopes in the formula, but they don't matter because there are 0 elements of those - so both $p^0$ and $0!$ end up to be 1 and so they won't have any impact on the result.

@johnmay
Copy link
Member

johnmay commented Nov 10, 2023

Yep, $S$ has 24 isotopes in CDK. The first 2 contribute to $0.9493 + 0.0429 = 0.9922$ (notice it's not $0.429$). Try using $0.0429$, because $1-0.9493$ is the probability of all other isotopes combined except for the 1st one. We could of course use all 24 isotopes in the formula, but they don't matter because there are 0 elements of those - so both $p^0$ and $0!$ end up to be 1 and so they won't have any impact on the result.

If only we could create S19 an measure the peak :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants