Bump CUDA from 11.8 to 12.0 #514

weiji14 · 2024-02-16T01:30:42Z

CUDA 12.0 migration across conda-forge is practically complete (see https://conda-forge.org/status/#cuda120), so we can start updating to a newer version of CUDA!

Note:

CUDA 12.x requires CUDA driver >=525.60.13, see https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-intro and https://docs.nvidia.com/cuda/archive/12.0.0/cuda-toolkit-release-notes/index.html#cuda-toolkit-major-component-versions. Previous CUDA 11.x required CUDA driver >= 450.80.02.

Changes in this PR:

Update Pytorch, Torchvision and Tensorflow to use CUDA 12.0 builds
Update minimum pin on Tensorflow from >=2.14.0 to >=2.15.0
Update minimum pin on Pytorch from 2.0.0 to 2.1.0, and torchvision from 0.15.1 to 0.16.1
Pin minimum version of flax to 0.8.0, helps to resolve Flax needs to be upgraded in the tensorflow/jax image #489

References:

Supersedes #505, Fixes #489

Update Pytorch, Torchvision and Tensorflow to use CUDA 12.0 builds. Also bumped pytorch from 2.0.0 to 2.1.0, torchvision from 0.15.1 to 0.16.1 and tensorflow from 2.14.0 to 2.15.0 because lower versions only has CUDA 11.8 on conda-forge.

github-actions · 2024-02-16T01:30:51Z

👈 Try on Mybinder.org!

pangeo-bot · 2024-02-16T01:30:53Z

/condalock
Automatically locking new conda environment, building, and testing images...

weiji14 · 2024-02-16T03:24:44Z

/condalock

Xref https://conda-forge.org/docs/maintainer/knowledge_base.html#cuda-builds

weiji14 · 2024-02-16T03:31:31Z

/condalock

weiji14 · 2024-02-16T03:42:54Z

Hmm, conda-lock is not handling the __cuda constraint somehow, even though I set the CONDA_OVERRIDE_CUDA environment variable already at 4d26a19. Traceback from https://github.com/pangeo-data/pangeo-docker-images/actions/runs/7925654432/job/21639166946#step:4:23:

The following package could not be installed
└─ tensorflow >=2.15.0 *cuda120* is not installable because it requires
   ├─ __cuda, which can be installed;
   └─ tensorflow-estimator [2.15.0 cuda120py310h549c77d_2|2.15.0 cuda120py310h549c77d_3|...|2.15.0 cuda120py39ha585809_3], which requires
      └─ cuda-version >=12.0,<13 , which requires
         └─ __cuda >=12 , which conflicts with any installable versions previously reported.

Need to see what's going on.

Manually re-locking

…ook" This reverts commit 7658847.

Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.3.0 to 2.5.5. - [Release notes](https://github.com/conda/conda-lock/releases) - [Commits](conda/conda-lock@v2.3.0...v2.5.5)

Bumps [conda](https://github.com/conda/conda) from 23.11.0 to 24.1.2. - [Release notes](https://github.com/conda/conda/releases) - [Changelog](https://github.com/conda/conda/blob/master/CHANGELOG.md) - [Commits](conda/conda@23.11.0...24.1.2)

Xref mamba-org/mamba#3120

This reverts commit 4d26a19.

weiji14 · 2024-05-17T19:17:18Z

Update: Finally, after setting a cuda-version=12.0 pin (instead of setting it in the build version), the ml-notebook locking worked at https://github.com/pangeo-data/pangeo-docker-images/actions/runs/9132472927/job/25113889544#step:4:20 🎉 . The locking on pytorch-notebook timed out after 20min though, so increasing that to 30min and trying again 🤞

weiji14 · 2024-05-17T20:19:24Z

Ah ok, can't just change the timeout in this PR, might need to have it on the master branch first.

weiji14 · 2024-05-21T02:39:49Z

Oh wow, now the locking takes just 2min 30s 😅 Anyways, this is finally ready for review!

Edit: Not so fast, the lockfiles show cpu versions getting pulled in instead of cuda versions, see #514 (comment)

weiji14 · 2024-05-21T03:56:45Z

pytorch-notebook/conda-lock.yml

    networkx: ''
    numpy: '>=1.23.5,<2.0a0'
    python: '>=3.11,<3.12.0a0'
    python_abi: 3.11.*
    sleef: '>=3.5.1,<4.0a0'
    sympy: ''
    typing_extensions: ''
-  url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda118_py311h6c9cb27_300.conda
+  url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cpu_mkl_py311h9835ca6_100.conda


Hmm, the cpu version is getting pulled in, might need to actually do the cuda* pin on the build number still...

Update: pytorch-2.3.0-cuda120* is pulled in now, see #514 (comment)

This reverts commit 12da355.

Partial revert of 537c27f

Pin minimum version of CUDA to 12.0, instead of specifying exact pin on 12.0

weiji14 · 2024-05-21T05:36:23Z

Found a clue after creating a virtual-packages.yml file with the following contents:

subdirs:
  linux-64:
    packages:
      __cuda: "12.0"

Running conda-lock lock -f environment.yml -f ../pangeo-notebook/environment.yml -f ../base-notebook/environment.yml -p linux-64 gave this error:

INFO:conda_lock.conda_lock:Using virtual packages from virtual-packages.yml
Locking dependencies for ['linux-64']...
INFO:conda_lock.conda_solver:linux-64 using specs ['cuda-version 12.0.*', 'jupyterlab-nvdashboard', 'gpytorch', 'pytorch >=2.3.0 cuda120*', 'torchvision >=0.18.0 cuda120*', 'torchgeo', 'adlfs', 'argopy', 'awscli', 'black', 'boto3', 'bottleneck', 'cartopy', 'cdsapi', 'cfgrib', 'cf_xarray', 'ciso', 'cmocean', 'dask-ml', 'datashader', 'descartes', 'earthaccess', 'eofs', 'erddapy', 'esmpy', 'fastjmd95', 'flox', 'fsspec', 'gcm_filters', 'gcsfs', 'gh', 'gh-scoped-creds', 'geocube', 'geopandas', 'geopy', 'geoviews-core', 'git-lfs', 'gsw', 'h5netcdf', 'h5py', 'holoviews', 'hvplot', 'intake', 'intake-esm', 'intake-geopandas', 'intake-stac', 'intake-xarray', 'ipdb', 'ipykernel', 'ipyleaflet', 'ipytree', 'ipywidgets', 'jupyterlab_code_formatter', 'jupyterlab-git', 'jupyterlab-lsp', 'jupyterlab-myst', 'jupyter-panel-proxy', 'jupyter-resource-usage', 'kerchunk', 'line_profiler', 'lxml', 'lz4', 'matplotlib-base', 'memory_profiler', 'metpy', 'nb_conda_kernels', 'nbstripout', 'nc-time-axis', 'netcdf4', 'numbagg', 'numcodecs', 'numpy', 'numpy_groupies', 'odc-stac', 'pandas', 'panel', 'parcels', 'param', 'pop-tools', 'pyarrow', 'pycamhd', 'pydap', 'pystac', 'pystac-client', 'python-blosc', 'python-gist', 'python-graphviz', 'python-lsp-ruff', 'python-xxhash', 'rasterio', 'rechunker', 'rio-cogeo', 'rioxarray', 'ruff', 's3fs', 'satpy', 'scikit-image', 'scikit-learn', 'scipy', 'seaborn', 'sparse', 'snakeviz', 'stackstac', 'tiledb-py', 'timezonefinder', 'watermark', 'xarray', 'xarrayutils', 'xarray-datatree', 'xarray_leaflet', 'xarray-spatial', 'xbatcher', 'xcape', 'xclim', 'xesmf', 'xgboost', 'xgcm', 'xhistogram', 'xmip', 'xmitgcm', 'xpublish', 'xrft', 'xskillscore', 'xxhash', 'zarr', 'python 3.11.*', 'pangeo-notebook 2024.05.20.*', 'pip']
Failed to parse json, Expecting value: line 1 column 1 (char 0)
Could not lock the environment for platform linux-64
Could not solve for environment specs
The following packages are incompatible
├─ cuda-version 12.0**  is requested and can be installed;
└─ torchvision >=0.18.0 cuda120* is not installable because it requires
   └─ libcusparse >=12.3.1.170,<13.0a0 , which requires
      └─ cuda-version >=12.4,<12.5.0a0 , which conflicts with any installable versions previously reported.
{
    "success": false
}

Need to play around with relaxing the cuda-version pinning a bit.

Xref https://github.com/conda/conda-lock/tree/v2.5.7?tab=readme-ov-file#--virtual-package-spec

weiji14 · 2024-05-21T05:38:19Z

/condalock

weiji14 · 2024-05-21T05:42:21Z

pytorch-notebook/conda-lock.yml

@@ -9234,10 +9394,10 @@ package:
    sleef: '>=3.5.1,<4.0a0'
    sympy: ''
    typing_extensions: ''
-  url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda118_py311h6c9cb27_300.conda
+  url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda120_py311h2667f23_300.conda


Yes, finally getting the cuda120 build version of pytorch!!!

scottyhq · 2024-05-21T15:30:03Z

Thanks for sticking with this @weiji14 ! Seems like adding vitural-packages.yml was key, although maybe it's just a matter of new releases with more compatible dependencies. In any case, this also reduces the ML image size by ~700MB !

tensorflow: 11.3 -> 10.7GB
pytorch: 13.7 -> 13 GB

I'm going to go ahead and merge.

weiji14 · 2024-05-21T20:45:51Z

Yeah, I want to say that the virtual-packages.yml was the thing that got it to work, but still confused as to why conda-lock's --with-cuda flag didn't work. It's hard to create a minimal reproducible example to report upstream, and I'm not sure which packages are nudging the solver to an 'incorrect' solution with cpu packages, so I'll just leave it at that for now.

In any case, this also reduces the ML image size by ~700MB !

Yes, this is because conda-forge has split the cudatoolkit package into different sub-components for CUDA 12 (see conda-forge/conda-forge.github.io#1963) like libcublas-dev, libcusparse-dev, etc, so should be much lightweight now.

weiji14 added 2 commits February 6, 2024 12:51

Bump CUDA from 11.8 to 12.0

b6f62f6

Update Pytorch, Torchvision and Tensorflow to use CUDA 12.0 builds. Also bumped pytorch from 2.0.0 to 2.1.0, torchvision from 0.15.1 to 0.16.1 and tensorflow from 2.14.0 to 2.15.0 because lower versions only has CUDA 11.8 on conda-forge.

Pin to flax>=0.8.0

0417e79

weiji14 self-assigned this Feb 16, 2024

Set with-cuda=12.0 flag on conda-lock lock

0a52466

Set CONDA_OVERRIDE_CUDA environment variable

4d26a19

Xref https://conda-forge.org/docs/maintainer/knowledge_base.html#cuda-builds

Regenerate conda-lock files for ml-notebook and pytorch-notebook

f66f91c

Manually re-locking

This comment was marked as duplicate.

Sign in to view

Delete conda-lock.yml files for ml-notebook and pytorch-notebook

7658847

This comment was marked as duplicate.

Sign in to view

weiji14 force-pushed the cuda-12.0 branch from f28f6e6 to 7658847 Compare February 16, 2024 04:10

This comment was marked as duplicate.

Sign in to view

weiji14 added 2 commits February 16, 2024 17:16

Revert "Delete conda-lock.yml files for ml-notebook and pytorch-noteb…

6c7a11d

…ook" This reverts commit 7658847.

Bump conda-lock from 2.3 to 2.5

47c4abf

Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.3.0 to 2.5.5. - [Release notes](https://github.com/conda/conda-lock/releases) - [Commits](conda/conda-lock@v2.3.0...v2.5.5)

This comment was marked as duplicate.

Sign in to view

Bump conda from 23.11.0 to 24.1.2

c016117

Bumps [conda](https://github.com/conda/conda) from 23.11.0 to 24.1.2. - [Release notes](https://github.com/conda/conda/releases) - [Changelog](https://github.com/conda/conda/blob/master/CHANGELOG.md) - [Commits](conda/conda@23.11.0...24.1.2)

This comment was marked as duplicate.

Sign in to view

Merge branch 'master' into cuda-12.0

eb154ce

This comment was marked as duplicate.

Sign in to view

Try manually removing cudatoolkit from conda-lock.yml

da73986

Xref mamba-org/mamba#3120

This comment was marked as duplicate.

Sign in to view

Merge branch 'master' into cuda-12.0

88fd9ce

This comment was marked as duplicate.

Sign in to view

Revert "Set CONDA_OVERRIDE_CUDA environment variable"

a842777

This reverts commit 4d26a19.

weiji14 mentioned this pull request May 17, 2024

Bump conda-lock to 2.5, and increase locking timeout to 45min #546

Merged

Merge branch 'master' into cuda-12.0

11f3c93

This comment has been minimized.

Sign in to view

weiji14 mentioned this pull request May 21, 2024

Increase conda-lock timeout to 3 hours #547

Merged

Merge branch 'master' into cuda-12.0

d9e8ef9

This comment was marked as duplicate.

Sign in to view

[condalock-command] autogenerated conda-lock files

12da355

weiji14 marked this pull request as ready for review May 21, 2024 02:40

weiji14 commented May 21, 2024

View reviewed changes

Revert "[condalock-command] autogenerated conda-lock files"

e337086

This reverts commit 12da355.

This comment was marked as duplicate.

Sign in to view

Keep cuda120 build string regex pattern

5d7e2e4

Partial revert of 537c27f

weiji14 force-pushed the cuda-12.0 branch from a912e4a to 5d7e2e4 Compare May 21, 2024 04:24

This comment was marked as duplicate.

Sign in to view

Set cuda-version pin as >=12.0 instead of =12.0

2469167

Pin minimum version of CUDA to 12.0, instead of specifying exact pin on 12.0

This comment was marked as duplicate.

Sign in to view

Add virtual-packages.yml to ml-notebook and pytorch-notebook

eb60068

Xref https://github.com/conda/conda-lock/tree/v2.5.7?tab=readme-ov-file#--virtual-package-spec

[condalock-command] autogenerated conda-lock files

b9b27ae

weiji14 commented May 21, 2024

View reviewed changes

scottyhq approved these changes May 21, 2024

View reviewed changes

scottyhq merged commit 6d4c2ab into master May 21, 2024
5 checks passed

scottyhq deleted the cuda-12.0 branch May 21, 2024 15:30

weiji14 mentioned this pull request May 21, 2024

Pin jaxlib to use cuda120 build with cuda-nvcc dependency #549

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump CUDA from 11.8 to 12.0 #514

Bump CUDA from 11.8 to 12.0 #514

weiji14 commented Feb 16, 2024 •

edited

github-actions bot commented Feb 16, 2024

pangeo-bot commented Feb 16, 2024

weiji14 commented Feb 16, 2024

weiji14 commented Feb 16, 2024

weiji14 commented Feb 16, 2024 •

edited

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

weiji14 commented May 17, 2024

weiji14 commented May 17, 2024

This comment has been minimized.

This comment was marked as duplicate.

weiji14 commented May 21, 2024 •

edited

weiji14 May 21, 2024

weiji14 May 21, 2024

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

weiji14 commented May 21, 2024

weiji14 commented May 21, 2024

weiji14 May 21, 2024

scottyhq commented May 21, 2024

weiji14 commented May 21, 2024

Bump CUDA from 11.8 to 12.0 #514

Bump CUDA from 11.8 to 12.0 #514

Conversation

weiji14 commented Feb 16, 2024 • edited

github-actions bot commented Feb 16, 2024

pangeo-bot commented Feb 16, 2024

weiji14 commented Feb 16, 2024

weiji14 commented Feb 16, 2024

weiji14 commented Feb 16, 2024 • edited

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

weiji14 commented May 17, 2024

weiji14 commented May 17, 2024

This comment has been minimized.

This comment was marked as duplicate.

weiji14 commented May 21, 2024 • edited

weiji14 May 21, 2024

Choose a reason for hiding this comment

weiji14 May 21, 2024

Choose a reason for hiding this comment

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

weiji14 commented May 21, 2024

weiji14 commented May 21, 2024

weiji14 May 21, 2024

Choose a reason for hiding this comment

scottyhq commented May 21, 2024

weiji14 commented May 21, 2024

weiji14 commented Feb 16, 2024 •

edited

weiji14 commented Feb 16, 2024 •

edited

weiji14 commented May 21, 2024 •

edited