Export of raw parameters as numpy array, plus some minor fixes #28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A common usage scenario for plmDCA nowadays is to use the raw Potts model parameters as an input for machine learning devices, especially Deep Learning systems, to infer contact or distance maps. The most recent and prominent example would be DeepMind's AlphaFold, the winner of CASP13.
CCMpred is widely used because of its GPU acceleration, but has the drawback of outputting the raw parameters as a text file, which can be huge (>10 GB) for longer proteins. Machine learning systems almost always expect numpy arrays as inputs, which are binary representations and therefore also faster to load since they are more compact.
I've implemented the option to directly write the raw paramters to numpy arrays with the command line switch '-y'. This circumvents the additional step of parsing the text output to generate a binary representation.
For long proteins, this makes a huge difference: On a TeslaV100, a MSA with 50k sequences of a protein with 820 residues took 26m13s to process in the traditional way (CCMpred -> raw text file -> parsing file to generate numpy array), whereas running CCMpred and directly writing a numpy array with my implementation took only 16m20s.
The speedups are not quite as remarkable for smaller proteins around the average lengths of 200-300 residues, but still account for 1-2 minutes saved per sample. For my current dataset, which contains ~80k MSAs, I expect to save multiple weeks of computation time.
Since I assume that CCMpred is used for exactly this kind of workflow in many structure prediction research projects, I kindly invite you to integrate this addition into the main repository.