Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump CUDA from 11.8 to 12.0 #514

Merged
merged 25 commits into from
May 21, 2024
Merged

Bump CUDA from 11.8 to 12.0 #514

merged 25 commits into from
May 21, 2024

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Feb 16, 2024

CUDA 12.0 migration across conda-forge is practically complete (see https://conda-forge.org/status/#cuda120), so we can start updating to a newer version of CUDA!

Note:

Changes in this PR:

  • Update Pytorch, Torchvision and Tensorflow to use CUDA 12.0 builds
  • Update minimum pin on Tensorflow from >=2.14.0 to >=2.15.0
  • Update minimum pin on Pytorch from 2.0.0 to 2.1.0, and torchvision from 0.15.1 to 0.16.1
  • Pin minimum version of flax to 0.8.0, helps to resolve Flax needs to be upgraded in the tensorflow/jax image #489

References:

Supersedes #505, Fixes #489

Update Pytorch, Torchvision and Tensorflow to use CUDA 12.0 builds. Also bumped pytorch from 2.0.0 to 2.1.0, torchvision from 0.15.1 to 0.16.1 and tensorflow from 2.14.0 to 2.15.0 because lower versions only has CUDA 11.8 on conda-forge.
@weiji14 weiji14 self-assigned this Feb 16, 2024
Copy link
Contributor

Binder 👈 Try on Mybinder.org!

@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@weiji14
Copy link
Member Author

weiji14 commented Feb 16, 2024

/condalock

@weiji14
Copy link
Member Author

weiji14 commented Feb 16, 2024

/condalock

@weiji14
Copy link
Member Author

weiji14 commented Feb 16, 2024

Hmm, conda-lock is not handling the __cuda constraint somehow, even though I set the CONDA_OVERRIDE_CUDA environment variable already at 4d26a19. Traceback from https://github.com/pangeo-data/pangeo-docker-images/actions/runs/7925654432/job/21639166946#step:4:23:

The following package could not be installed
└─ tensorflow >=2.15.0 *cuda120* is not installable because it requires
   ├─ __cuda, which can be installed;
   └─ tensorflow-estimator [2.15.0 cuda120py310h549c77d_2|2.15.0 cuda120py310h549c77d_3|...|2.15.0 cuda120py39ha585809_3], which requires
      └─ cuda-version >=12.0,<13 , which requires
         └─ __cuda >=12 , which conflicts with any installable versions previously reported.

Need to see what's going on.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

1 similar comment
@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

1 similar comment
@weiji14

This comment was marked as duplicate.

@weiji14
Copy link
Member Author

weiji14 commented May 17, 2024

Update: Finally, after setting a cuda-version=12.0 pin (instead of setting it in the build version), the ml-notebook locking worked at https://github.com/pangeo-data/pangeo-docker-images/actions/runs/9132472927/job/25113889544#step:4:20 🎉 . The locking on pytorch-notebook timed out after 20min though, so increasing that to 30min and trying again 🤞

@weiji14
Copy link
Member Author

weiji14 commented May 17, 2024

Ah ok, can't just change the timeout in this PR, might need to have it on the master branch first.

@weiji14

This comment has been minimized.

@weiji14

This comment was marked as duplicate.

@weiji14
Copy link
Member Author

weiji14 commented May 21, 2024

Oh wow, now the locking takes just 2min 30s 😅 Anyways, this is finally ready for review!

Edit: Not so fast, the lockfiles show cpu versions getting pulled in instead of cuda versions, see #514 (comment)

@weiji14 weiji14 marked this pull request as ready for review May 21, 2024 02:40
networkx: ''
numpy: '>=1.23.5,<2.0a0'
python: '>=3.11,<3.12.0a0'
python_abi: 3.11.*
sleef: '>=3.5.1,<4.0a0'
sympy: ''
typing_extensions: ''
url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda118_py311h6c9cb27_300.conda
url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cpu_mkl_py311h9835ca6_100.conda
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the cpu version is getting pulled in, might need to actually do the cuda* pin on the build number still...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: pytorch-2.3.0-cuda120* is pulled in now, see #514 (comment)

@weiji14

This comment was marked as duplicate.

@weiji14

This comment was marked as duplicate.

Pin minimum version of CUDA to 12.0, instead of specifying exact pin on 12.0
@weiji14

This comment was marked as duplicate.

@weiji14
Copy link
Member Author

weiji14 commented May 21, 2024

Found a clue after creating a virtual-packages.yml file with the following contents:

subdirs:
  linux-64:
    packages:
      __cuda: "12.0"

Running conda-lock lock -f environment.yml -f ../pangeo-notebook/environment.yml -f ../base-notebook/environment.yml -p linux-64 gave this error:

INFO:conda_lock.conda_lock:Using virtual packages from virtual-packages.yml
Locking dependencies for ['linux-64']...
INFO:conda_lock.conda_solver:linux-64 using specs ['cuda-version 12.0.*', 'jupyterlab-nvdashboard', 'gpytorch', 'pytorch >=2.3.0 cuda120*', 'torchvision >=0.18.0 cuda120*', 'torchgeo', 'adlfs', 'argopy', 'awscli', 'black', 'boto3', 'bottleneck', 'cartopy', 'cdsapi', 'cfgrib', 'cf_xarray', 'ciso', 'cmocean', 'dask-ml', 'datashader', 'descartes', 'earthaccess', 'eofs', 'erddapy', 'esmpy', 'fastjmd95', 'flox', 'fsspec', 'gcm_filters', 'gcsfs', 'gh', 'gh-scoped-creds', 'geocube', 'geopandas', 'geopy', 'geoviews-core', 'git-lfs', 'gsw', 'h5netcdf', 'h5py', 'holoviews', 'hvplot', 'intake', 'intake-esm', 'intake-geopandas', 'intake-stac', 'intake-xarray', 'ipdb', 'ipykernel', 'ipyleaflet', 'ipytree', 'ipywidgets', 'jupyterlab_code_formatter', 'jupyterlab-git', 'jupyterlab-lsp', 'jupyterlab-myst', 'jupyter-panel-proxy', 'jupyter-resource-usage', 'kerchunk', 'line_profiler', 'lxml', 'lz4', 'matplotlib-base', 'memory_profiler', 'metpy', 'nb_conda_kernels', 'nbstripout', 'nc-time-axis', 'netcdf4', 'numbagg', 'numcodecs', 'numpy', 'numpy_groupies', 'odc-stac', 'pandas', 'panel', 'parcels', 'param', 'pop-tools', 'pyarrow', 'pycamhd', 'pydap', 'pystac', 'pystac-client', 'python-blosc', 'python-gist', 'python-graphviz', 'python-lsp-ruff', 'python-xxhash', 'rasterio', 'rechunker', 'rio-cogeo', 'rioxarray', 'ruff', 's3fs', 'satpy', 'scikit-image', 'scikit-learn', 'scipy', 'seaborn', 'sparse', 'snakeviz', 'stackstac', 'tiledb-py', 'timezonefinder', 'watermark', 'xarray', 'xarrayutils', 'xarray-datatree', 'xarray_leaflet', 'xarray-spatial', 'xbatcher', 'xcape', 'xclim', 'xesmf', 'xgboost', 'xgcm', 'xhistogram', 'xmip', 'xmitgcm', 'xpublish', 'xrft', 'xskillscore', 'xxhash', 'zarr', 'python 3.11.*', 'pangeo-notebook 2024.05.20.*', 'pip']
Failed to parse json, Expecting value: line 1 column 1 (char 0)
Could not lock the environment for platform linux-64
Could not solve for environment specs
The following packages are incompatible
├─ cuda-version 12.0**  is requested and can be installed;
└─ torchvision >=0.18.0 cuda120* is not installable because it requires
   └─ libcusparse >=12.3.1.170,<13.0a0 , which requires
      └─ cuda-version >=12.4,<12.5.0a0 , which conflicts with any installable versions previously reported.
{
    "success": false
}

Need to play around with relaxing the cuda-version pinning a bit.

@weiji14
Copy link
Member Author

weiji14 commented May 21, 2024

/condalock

@@ -9234,10 +9394,10 @@ package:
sleef: '>=3.5.1,<4.0a0'
sympy: ''
typing_extensions: ''
url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda118_py311h6c9cb27_300.conda
url: https://conda.anaconda.org/conda-forge/linux-64/pytorch-2.3.0-cuda120_py311h2667f23_300.conda
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, finally getting the cuda120 build version of pytorch!!!

@scottyhq
Copy link
Member

Thanks for sticking with this @weiji14 ! Seems like adding vitural-packages.yml was key, although maybe it's just a matter of new releases with more compatible dependencies. In any case, this also reduces the ML image size by ~700MB !

tensorflow: 11.3 -> 10.7GB
pytorch: 13.7 -> 13 GB

I'm going to go ahead and merge.

@scottyhq scottyhq merged commit 6d4c2ab into master May 21, 2024
5 checks passed
@scottyhq scottyhq deleted the cuda-12.0 branch May 21, 2024 15:30
@weiji14
Copy link
Member Author

weiji14 commented May 21, 2024

Yeah, I want to say that the virtual-packages.yml was the thing that got it to work, but still confused as to why conda-lock's --with-cuda flag didn't work. It's hard to create a minimal reproducible example to report upstream, and I'm not sure which packages are nudging the solver to an 'incorrect' solution with cpu packages, so I'll just leave it at that for now.

In any case, this also reduces the ML image size by ~700MB !

Yes, this is because conda-forge has split the cudatoolkit package into different sub-components for CUDA 12 (see conda-forge/conda-forge.github.io#1963) like libcublas-dev, libcusparse-dev, etc, so should be much lightweight now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flax needs to be upgraded in the tensorflow/jax image
3 participants