Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export of raw parameters as numpy array, plus some minor fixes #28

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kWeissenow
Copy link

A common usage scenario for plmDCA nowadays is to use the raw Potts model parameters as an input for machine learning devices, especially Deep Learning systems, to infer contact or distance maps. The most recent and prominent example would be DeepMind's AlphaFold, the winner of CASP13.
CCMpred is widely used because of its GPU acceleration, but has the drawback of outputting the raw parameters as a text file, which can be huge (>10 GB) for longer proteins. Machine learning systems almost always expect numpy arrays as inputs, which are binary representations and therefore also faster to load since they are more compact.

I've implemented the option to directly write the raw paramters to numpy arrays with the command line switch '-y'. This circumvents the additional step of parsing the text output to generate a binary representation.
For long proteins, this makes a huge difference: On a TeslaV100, a MSA with 50k sequences of a protein with 820 residues took 26m13s to process in the traditional way (CCMpred -> raw text file -> parsing file to generate numpy array), whereas running CCMpred and directly writing a numpy array with my implementation took only 16m20s.
The speedups are not quite as remarkable for smaller proteins around the average lengths of 200-300 residues, but still account for 1-2 minutes saved per sample. For my current dataset, which contains ~80k MSAs, I expect to save multiple weeks of computation time.

Since I assume that CCMpred is used for exactly this kind of workflow in many structure prediction research projects, I kindly invite you to integrate this addition into the main repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants