Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of ErtlFunctionalGroupsFinder and Sugar Removal Utility into CDK #854

Open
JonasSchaub opened this issue Apr 11, 2022 · 7 comments
Assignees

Comments

@JonasSchaub
Copy link
Contributor

Dear CDK-developers,

I have been involved in two projects that might be interesting for integration into the CDK. I have spoken with Egon about them last week and I will give all necessary details for everyone below:

The first one is the ErtlFunctionalGroupsFinder, an open, CDK-based implementation of Peter Ertl's algorithm for functional group detection. Our implementation is described in this paper and the source code can be found here.
Getting it into CDK would require integrating one principal class implementing the functionality, a corresponding test class (these two can be found here), and (if you agree) a utility class with some convenient routines for pre-and postprocessing in a workflow using ErtlFunctionalGroupsFinder. The utility class along with its test class can be found here.

The second project is named Sugar Removal Utility, an open, CDK-based implementation of an algorithm for in silico deglycosylation. We have described the algorithm and its implementation in this paper and used it for the project described here that illustrates its possibilities and configurations in more detail. The source code can be found here.
As far as we know, there is no deglycosylation functionality in CDK, only a CDK-based KNIME node for this purpose.
The Sugar Removal Utility consists of one principal class implementing the functionality and a corresponding test class. In the latter, a few test methods would have to be removed that illustrate the deglycosylation results on a small test data set and test the surrounding command-line application.

Do you think these two projects should be integrated into the CDK? Should I create two pull request where we can work on the details?
One important question I have would be into which module to put my classes. Ideally, there would be one that fits thematically and supports the modules I need for my code. Otherwise, we could maybe also add one or two new modules.

The Sugar Removal Utility needs the following modules:

  • cdk-interfaces
  • cdk-standard
  • cdk-core
  • cdk-isomorphism
  • cdk-smarts
  • cdk-smiles
  • cdk-valencycheck
  • cdk-data (test scope)

ErtlFunctionalGroupsFinder requires these modules:

  • cdk-core
  • cdk-interfaces
  • cdk-hash
  • cdk-smiles
  • cdk-standard
  • cdk-isomorphism (test scope)
  • cdk-data (test scope)

I would be happy to give more details on specific aspects of the two projects or discuss any required code changes.

Kind regards,
Jonas Schaub

@egonw
Copy link
Member

egonw commented Jun 6, 2022

hi @JonasSchaub, I'm sorry I dropped the ball. In my defense, I still need to write up a blog post about the CDK meeting.

@johnmay, I think both utility are useful and should be accepted as features in the CDK. Do you agree? If so, let's set up that integration pull request to work out the details.

@JonasSchaub
Copy link
Contributor Author

Hi @egonw

no worries, I am also looking forward to that blog post about our meeting!

On the subject at hand: I guess it would be best if I simply create one or two separate new modules for the two functionalities to freely manage the dependencies. Would you agree? If yes, would you have name suggestions/preferences?

And what should be the target branch of the pull request?

But I also have to note that I probably will not have much time to work on this in the coming weeks, unfortunately.

Kind regards,
Jonas

@johnmay
Copy link
Member

johnmay commented Jun 7, 2022

Sounds good, hard to decide where they would go. FuncGroups is sort a descriptor and SugarRemoval is sort of a standardisation (which we don't have a module for).

@JonasSchaub
Copy link
Contributor Author

Hi @johnmay

I would propose to create a "standardisation" module, then. We are planning to implement further standardisation routines with/for CDK anyway in our group.

@JonasSchaub
Copy link
Contributor Author

Hi @egonw and @johnmay ,
picking this up after some time: if you both still agree that both functionalities would be beneficial for CDK to integrate, I would start two pull requests. What I need to know for this is which branch to target in the CDK repo and into which module to integrate my code. Should I put the ErtlFunctionalGroupsFinder into the descriptor module and start a new "standardisation" module for the Sugar Removal Utility?

For your consideration, here are the (updated) repositories:

The CDK integration would in both cases consist of one principal class implementing the functionality and one corresponding test class, respectively.

@johnmay
Copy link
Member

johnmay commented Feb 7, 2024

I would try and avoid adding a new module, if anything I would like to try and simplify/collapse the number modules.

  • Possibly ErtlFunctionalGroupFinder in cdk-fragment but need to check the output. Also I would probably drop the "Ertl" in the naming, but should of course be cited in the documentation.
  • SugarRemoval possibly in cdk-extra for now, cdk-standardise isn't a terrible idea but there needs to be a lot more functionality to add that. We already have some single class modules like IONPOT and PDBCML for example and it's bad design IMO.

Quite busy these days but I will try and find time within the next week to take a close a look. At first glance the ErtlFunctionalGroupFinder looks more complicated/verbose than I imagined, there are lots of public helper functions (isCharged() for example) which should be private/package-private so they are not leaked in the JavaDoc. I need to reread/remember how it works but at first pass this looks more complicated than I thought. For comparison the RDKit one is 80 lines: https://github.com/rdkit/rdkit/blob/master/Contrib/IFG/ifg.py

@JonasSchaub
Copy link
Contributor Author

Thanks @johnmay for the initial feedback!

there are lots of public helper functions (isCharged() for example) which should be private/package-private so they are not leaked in the JavaDoc.

These are utility functions for the user to pre-check whether their input molecule is a valid argument (in a specific case, strict input restrictions can be turned on but are turned off by default). If you think these are not necessary I can remove them.

looks more complicated than I thought. For comparison the RDKit one is 80 lines: https://github.com/rdkit/rdkit/blob/master/Contrib/IFG/ifg.py

Yes, our implementation is "longer" but it is optimised in terms of speed and has more functionality than the RDKit implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants