Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build openfold with newer ppytorch + cuda #403

Open
chunhui-shi opened this issue Feb 7, 2024 · 12 comments
Open

Build openfold with newer ppytorch + cuda #403

chunhui-shi opened this issue Feb 7, 2024 · 12 comments

Comments

@chunhui-shi
Copy link

Right now openfold asks for old pytorch + cuda (11.2), thus latest linux is not able to build openfold.

Would like to upgrade the supported pytorch + cuda and other python packages accordingly, so people can use newer platform(OS, etc)

@vaclavhanzl
Copy link
Contributor

vaclavhanzl commented Feb 18, 2024

Indeed, preparing the environment on newer platform is far from easy. I just did it this way on my Debian testing rolling setup:

mamba create -n of
mamba activate of
mamba install -c pytorch -c nvidia -c conda-forge -c bioconda pytorch pytorch-cuda=12.1 python=3.10 packaging ninja hhsuite kalign2 openmm pdbfixer biopython pytorch-lightning PyYAML tqdm wandb awscli aria2 hmmer deepspeed dm-tree py3Dmol modelcif
pip install flash-attention ml_collections git+https://github.com/NVIDIA/dllogger.git
git clone git@github.com:aqlaboratory/openfold.git
cd openfold
sed -i -e 's/-std=c++14/-std=c++17/' setup.py 
scripts/install_third_party_dependencies.sh
mamba deactivate
mamba activate of

When iterating towards these lines, I encountered few pitfalls:

  • hhsuite requires python 10 (not 11)
  • pytorch 2.2.0 requiers compilation with c++17, not 14
  • you really need well installed CUDA (that it worked for other things is not enough)

And the preceding CUDA setup has pitfalls as well:

  • non-free-firmware section is new in Debian, might be missing in sources.list but is vital
  • recent Nvidia Ampere driver does not compile with recent Linux kernel but Debian has fixed driver in bookworm-updates

I find CUDA setup via Debian repos easier than via Nvidia (in fact at this moment the critical bug 4336331 in Nvidia driver is only fixed in Debian). I add these sources:

deb http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware

and install:

apt install nvidia-cuda-dev nvidia-cuda-toolkit

I got these package versions:

pytorch              2.2.0    py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda         12.1              ha16c6d3_5    pytorch
python               3.10.13   hd12c33a_1_cpython    conda-forge
packaging            23.2            pyhd8ed1ab_0    conda-forge
ninja                1.11.1            h924138e_0    conda-forge
hhsuite              3.3.0  py310pl5321h068649b_9    bioconda
kalign2              2.04              h031d066_5    bioconda
openmm               8.1.1        py310h358ce72_1    conda-forge
pdbfixer             1.9             pyh1a96a4e_0    conda-forge
biopython            1.83         py310h2372a71_0    conda-forge
pytorch-lightning    2.1.3           pyhd8ed1ab_0    conda-forge
pyyaml               6.0.1        py310h2372a71_1    conda-forge
tqdm                 4.66.2          pyhd8ed1ab_0    conda-forge
wandb                0.16.3          pyhd8ed1ab_0    conda-forge
awscli               2.15.21      py310hff52083_0    conda-forge
aria2                1.37.0            h347180d_1    conda-forge
hmmer                3.4               hdbdd923_0    bioconda
deepspeed            0.13.1   cpu_py310h11dbdba_0    conda-forge
dm-tree              0.1.8        py310h620c231_2    conda-forge
py3dmol              2.0.4           pyhd8ed1ab_0    conda-forge
modelcif             0.9             pyhd8ed1ab_0    conda-forge
flash-attention      1.0.0                 pypi_0    pypi
ml-collections       0.1.1                 pypi_0    pypi
dllogger             1.0.0                 pypi_0    pypi

and outside mamba environment, I have:

$ gcc --version
gcc (Debian 10.3.0-15) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

$ nvidia-smi 
NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

$ cat /proc/version
Linux version 6.6.15-amd64 (debian-kernel@lists.debian.org) (gcc-13 (Debian 13.2.0-13) 13.2.0, GNU ld (GNU Binutils for Debian) 2.42) #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04)

It is possible that OpenFold will need some little tweeks here and there with this setup. But I hope this helps a little bit...

@vaclavhanzl
Copy link
Contributor

vaclavhanzl commented Feb 19, 2024

Meanwhile, PR #407 just landed in the codebase (thanks @jnwei and everybody involved!) and it is supposed to tackle these issues. While it certainly moves the code forward (and may likely contain some needed "tweeks here and there" I mentioned above), it still does not allow me to install an environment just following the install instructions in README. When I do:

mamba env create -n openfold_env -f environment.yml

it tries to install some older packages than PR407 description suggests:

Looking for: ['python=3.9', 'libgcc=7.2', 'setuptools=59.5.0', 'pip', 'openmm=7.7', 'pdbfixer', 'cudatoolkit=11.3', 'pytorch-lightning==1.5.10', 'biopython==1.79', 'numpy==1.21', 'pandas==2.0', 'pyyaml==5.4.1', 'requests', 'scipy==1.7', 'tqdm==4.62.2', 'typing-extensions==3.10', 'wandb==0.12.21', 'modelcif==0.7', 'awscli', 'ml-collections', 'aria2', 'git', 'bioconda::hmmer==3.3.2', 'bioconda::hhsuite==3.3.0', 'bioconda::kalign2==2.04', 'pytorch::pytorch=1.12']

and then fails with:

      RuntimeError:
      The detected CUDA version (12.0) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.

In the current environment.yml I find suspicious pytorch-lightning==1.5.10 which might lead to older Pytorch (?), in my experiments above I got pytorch-lightning=2.1.3. (Or maybe python=3.9 is a problem? Certainly python=3.10 worked better for me and might influence the available Pytorch versions.)

Also, the C++14/17 patch (which I did above via sed) would likely be needed for the following compilation in scripts/install_third_party_dependencies.sh to succeed.

@vaclavhanzl
Copy link
Contributor

@abeebyekeen I see that you also devoted considerable effort to setting up environment.yml. It would be nice to hear how your current setup works for you. (I am particularly interested in effects of the 'cuda' conda package - maybe it allows even more minimalist cuda setup in the operating system? Just kernel driver?) Or how it compares with my setup (2nd post in this thread) if you have any incentive to try.

@abeebyekeen
Copy link

abeebyekeen commented Feb 20, 2024

Hi @vaclavhanzl. Yes, I spent a good part of last weekend trying to setup a tool that requires openfold as a dependency. I was initially unable to build openfold due to a number of problems including -std=c++14 , gcc/g++, CUDA version mismatch errors as you have mentioned. Please note that I also had other CUDA errors thrown by pytorch. So I created a fork to try and figure out where each error was coming from (especially with building openfold).

Here are what I've got and the selections that eventually worked for me in solving all the problems:

  • Python version: 3.9.18
  • Linux: Red Hat 4.8.5-44, 3.10.0-1160.88.1.el7.x86_64
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$ gcc --version
gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.

For the environment I needed, here is how I set it up:

mamba create -npl python==3.9 pip
mamba activate npl
mamba install cudatoolkit==11.8.*
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache-dir
python -m pip install torch-scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.2.0+118.html --no-cache-dir
python -m pip install "git+https://github.com/facebookresearch/pytorch3d.git" --force-reinstall --no-deps --no-cache-dir

To get openfold to build, I was able to use the environment.yml in the openfold repo without changes. I, however, had to use a gcc version between 5. and 11., and also set the build flags in setup.py to std=c++17 just like @vaclavhanzl did:

git clone https://github.com/aqlaboratory/openfold.git
cd openfold
sed -i 's/std=c++14/std=c++17/g' setup.py
python -m pip install .

python -m pip install -r other_requirements.txt

And that works perfectly.

@vaclavhanzl
Copy link
Contributor

Thanks a lot @abeebyekeen for sharing this with us, in a very clear way! I'd very much like to see OpenFold working out of the box for most new users, especially now when there are new great and publicized features. I still do not know how a PR going that way should look, addressing different setups people might have is hard. I was thinking about having ranges of versions in environment.yml but that is not an easy way either.
I guess @jnwei might have some plan here and I hope we help at least a little bit to move that way.

@jnwei
Copy link
Collaborator

jnwei commented Feb 23, 2024

Hi all,

Thanks for all the interest and for sharing notes! The environments I wanted to support at this time were:

  • For use with CUDA 11.x: (main branch) pytorch 1.12 + pytorch lightning 1.5.10 (+ flash-attn dependencies)
  • For use with CUDA 12.x: (pl_upgrades) pytorch 2.1 + pytorch lightning 2.x

I just checked that the pl_upgrades branch on two systems I have access to with pre-installed CUDA 12, and found that they were working for me. Let me know if folks have issues with this environments.

My understanding was that having an environment which is CUDA 11.x + Pytorch 2.x is complicated, as the default pytorch 2 packages are built on CUDA 12 (leading to the CUDA mismatch error @vaclavhanzl saw). It looks like @abeebyekeen was able to find a workaround with a lot of elbow grease, thanks for sharing your fix!

I plan on cleaning up the documentation for this project and when I do, I'll add a page regarding the supported environments.

@vaclavhanzl
Copy link
Contributor

Thanks @jnwei ! I am happy to report that the pl_upgrades branch works flawlessly with my CUDA 12, including compilations in install_third_party_dependencies.sh (the end of the 2nd post here describes my exact environment outside conda).

(And please @jnwei excuse my rather misguided comments on PR #407 - I totally overlooked that you merged to pl_upgrades, not to main.)

@wenyan4work
Copy link

quick question about future plan:
is pl_upgrades going to be merged into master?

@lm-jkominek
Copy link

Hi there, just wanted to follow up on this and ask if there are any plans/timelines to merge pl_upgrades into main to get the CUDA12 support into openfold? Many thanks in advance!
@vaclavhanzl @jnwei

@jnwei
Copy link
Collaborator

jnwei commented Mar 15, 2024

Hi thanks for the interest. We're actively working on finalizing the changes in pl_upgrades into main. I'd expect ~3ish weeks

@lm-jkominek
Copy link

Thank you @jnwei , appreciate the update!

@jnwei
Copy link
Collaborator

jnwei commented May 3, 2024

A quick note on the pytorch 2 / CUDA 12 upgrade:

We've run into some technical issues with the pytorch 2 upgrade. Briefly, we observe large instabilities in our training losses in the pytorch2 version relative to our pytorch 1 version.

For inference, we're also observing a slight difference between model outputs in pytorch 1 and pytorch 2. The difference in final output coordinates is about RMSD~0.05A for the proteins I've looked at While these differences might seem small, it may point to a larger issue that is also occurring in training; we're currently looking into it.

Until we find the root cause of the discrepancy, or a way around the training instability, we're not ready to update the main branch to pytorch 2.

Meanwhile, we will upgrade the main branch to use pytorch lightning 2, which has a few features that the team has found useful. I'll also push some changes to pl_upgrades that integrate some of the changes from the main branch, and cleans up the conda environment / docker for a CUDA 12 / pytorch 2.

We are actively working on debugging the instability, and we'll keep you posted as soon as we are ready to upgrade. Thank you all for your interest and your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants