Export of raw parameters as numpy array, plus some minor fixes #28

kWeissenow · 2020-02-27T09:14:36Z

A common usage scenario for plmDCA nowadays is to use the raw Potts model parameters as an input for machine learning devices, especially Deep Learning systems, to infer contact or distance maps. The most recent and prominent example would be DeepMind's AlphaFold, the winner of CASP13.
CCMpred is widely used because of its GPU acceleration, but has the drawback of outputting the raw parameters as a text file, which can be huge (>10 GB) for longer proteins. Machine learning systems almost always expect numpy arrays as inputs, which are binary representations and therefore also faster to load since they are more compact.

I've implemented the option to directly write the raw paramters to numpy arrays with the command line switch '-y'. This circumvents the additional step of parsing the text output to generate a binary representation.
For long proteins, this makes a huge difference: On a TeslaV100, a MSA with 50k sequences of a protein with 820 residues took 26m13s to process in the traditional way (CCMpred -> raw text file -> parsing file to generate numpy array), whereas running CCMpred and directly writing a numpy array with my implementation took only 16m20s.
The speedups are not quite as remarkable for smaller proteins around the average lengths of 200-300 residues, but still account for 1-2 minutes saved per sample. For my current dataset, which contains ~80k MSAs, I expect to save multiple weeks of computation time.

Since I assume that CCMpred is used for exactly this kind of workflow in many structure prediction research projects, I kindly invite you to integrate this addition into the main repository.

…nd line option '-y'

…ficient memory layout when reading crops

KonstantinWeissenow and others added 4 commits February 25, 2020 10:42

Added correct header for basename(), fixed CUDA error checking macro

9587005

Fixed memory calculation in README

106bf7d

Added option to output raw parameters as a numpy array with the comma…

484d204

…nd line option '-y'

Changed shape of numpy output from (441,L,L) to (L,L,441) for more ef…

df6e139

…ficient memory layout when reading crops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export of raw parameters as numpy array, plus some minor fixes #28

Export of raw parameters as numpy array, plus some minor fixes #28

kWeissenow commented Feb 27, 2020

Export of raw parameters as numpy array, plus some minor fixes #28

Are you sure you want to change the base?

Export of raw parameters as numpy array, plus some minor fixes #28

Conversation

kWeissenow commented Feb 27, 2020