Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a docker container (and docker-compose file) to run the model/notebooks in a containerize environment #166

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

leothomas
Copy link
Member

In order to make it easier to run the model/notebooks without having to manage installing dependencies across various machines/environments, I've added a micromamba based Dockerfile which will create the conda environment with the specified libraries.

I've also added a docker compose file, in order to specify the build-time and run-time arguments for exposing the jupyter lab port and mounting the current directly as a volume, into the docker container. This will allow users to modify any of the model/notebook code locally, without having to re-build the image.

By default the docker image starts with running jupyter lab but this can be overridden both in the docker-compose or even in the command line with any other python or bash command.

The platform=linux/amd64 build and run-time args enable the image to be built on Mac M1 while maintaining compatibility with Linux.

The container can be run with:
docker-compose up or docker-compose run claymodel <command> where command is a command which override the jupyter lab startup

The container can also be built directly (bypassing the need for docker-compose) with:

docker build . -t clay --platform linux/amd64

and then run with:

docker run --rm -it -v $(pwd):/model -p 8888:8888 -e ENV_NAME=claymodel --platform linux/amd64 clay:latest

Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leothomas! Just some small comments for now. Do you think we should also add a .dockerignore file to keep the Docker image a pure virtual environment? Also, what are your thoughts on setting up some CI to push pre-built containers to a docker registry (can be done in a separate PR)?

environment.yml Outdated
Comment on lines 17 to 18
- pytorch~=2.1.0
- pyarrow~=15.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you resolve the merge conflict here with the main branch? Also see if removing the pyarrow pin works. I managed to get conda to solve for osx-arm64 recently on #164 after https://github.com/conda-forge/torchvision-feedstock/pull/89/files was merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. You should be able to simply ditch your changes to environment.yml. If not, then keep in mind that whenever you modify environment.yml, you must regenerate conda-lock.yml, so you should never be committing changes to only environment.yml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean conflicts with ci/osx-arm64, rather than conflicts with main? I didn't notice any conflicts with main.

I merged the changes from ci/osx-arm64 and, while I'm able to install from conda-lock.yml (as @chuckwondo suggested) I'm not able to install from environment.yml due to cuda being unavailable in the docker container:

> [claymodel 4/4] RUN micromamba create -y -n claymodel --file environment.yml &&     micromamba clean --all --yes:
135.1 error    libmamba Could not solve for environment specs
135.1     The following package could not be installed
135.1     └─ pytorch ~=2.1.0 *cuda12* is not installable because it requires
135.1        └─ __cuda, which is missing on the system.
135.3 critical libmamba Could not solve for environment specs
------
failed to solve: process "/usr/local/bin/_dockerfile_shell.sh micromamba create -y -n claymodel --file environment.yml &&     micromamba clean --all --yes" did not complete successfully: exit code: 1

I'm thinking that the conda-lock.yml and environment.yml files aren't synchronized?

Let me know what the best course of action is. I can look into making a composite docker image, based off of both micromamba and nvidia/cuda so that we can install pytorch with cuda support in the docker container

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean conflicts with ci/osx-arm64, rather than conflicts with main? I didn't notice any conflicts with main.

I meant with main actually. The changes in the ci/osx-arm64 branch aren't much.

I merged the changes from ci/osx-arm64 and, while I'm able to install from conda-lock.yml (as @chuckwondo suggested) I'm not able to install from environment.yml due to cuda being unavailable in the docker container:

Yet, best to go with conda-lock.yml as Chuck suggested. If you want to install from environment.yml on a device without CUDA GPUs, set CONDA_OVERRIDE_CUDA=12.0 following https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html#overriding-detected-packages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's odd - there were no merge conflict with main in the case. I'm might have been missing something. Do the environment.yml and conda-lock.yml files look correct in their current state?

Dockerfile Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
@weiji14 weiji14 added the maintenance Boring but important stuff for the core devs label Feb 27, 2024
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the individual comments/suggestions, please also add a section to README.md about how to run the Docker container (docker compose up) as an alternative to installing things locally, and how to access JupyterLab once the container is started. In particular, mention the URLs that appears in the logging messages:

claymodel-1  |     Or copy and paste one of these URLs:
claymodel-1  |         http://ffd23ea64b9b:8888/lab?token=abebb7b9476b8fff1fb8b543cc552e8c9641b5b38547c3cb
claymodel-1  |         http://127.0.0.1:8888/lab?token=abebb7b9476b8fff1fb8b543cc552e8c9641b5b38547c3cb

Dockerfile Outdated Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@chuckwondo
Copy link
Collaborator

Thanks @leothomas! Just some small comments for now. Do you think we should also add a .dockerignore file to keep the Docker image a pure virtual environment? Also, what are your thoughts on setting up some CI to push pre-built containers to a docker registry (can be done in a separate PR)?

Absolutely add a .dockerignore file to this PR. You can probably start with much (or all) of what's in .gitignore.

However, keep in mind that there are cases where you must be more explicit in .dockerignore.

For example, you cannot use only __pycache__/ in .dockerignore because that will ignore only a top level __pycache__/ directory. To ignore such a directory at all levels, you must use **/__pycache__/ in .dockerignore. Another good candidate for this is to add **/.ipynb_checkpoints/ since there are notebooks in the docs directory.

@chuckwondo
Copy link
Collaborator

Thanks @leothomas! Just some small comments for now. Do you think we should also add a .dockerignore file to keep the Docker image a pure virtual environment? Also, what are your thoughts on setting up some CI to push pre-built containers to a docker registry (can be done in a separate PR)?

Absolutely add a .dockerignore file to this PR. You can probably start with much (or all) of what's in .gitignore.

However, keep in mind that there are cases where you must be more explicit in .dockerignore.

For example, you cannot use only __pycache__/ in .dockerignore because that will ignore only a top level __pycache__/ directory. To ignore such a directory at all levels, you must use **/__pycache__/ in .dockerignore. Another good candidate for this is to add **/.ipynb_checkpoints/ since there are notebooks in the docs directory.

Alternatively, as I mentioned to you ages ago, I tend to "invert" my use of .dockerignore to make it act more like an "allow" list rather than a "deny" list, which I find to be safer and clearer. For example, consider making a .dockerignore that ignores everything and then does not ignore the things you want to "allow":

*
!environment.yml
!conda-lock.yml
!**/*.py
!**/*.ipynb
!**/*.sh

@leothomas
Copy link
Member Author

Awesome! Thank you both! I've addressed some of the request changes:

  • Regenerating the conda-lock.yml
  • Installing the micromamba environment from conda-lock.yml rather than environment.yml
  • Removing the pyarrow dependency
  • Adding a .dockerignore which ignores everything by default and only allows the files specifically needed
  • Updating the README with relevant info on the docker image/container and how to run it
  • Updating the jupyter lab flags to avoid the unnecessary usage of --allow-root and address some warning/debugging logs

Do y'all think it would be valuable to try to see if I can get this to run with a cuda docker image to enable the pytorch-cuda installation?

README.md Outdated Show resolved Hide resolved
docs/partial-inputs-flood-tutorial.ipynb Outdated Show resolved Hide resolved
conda-lock.yml Outdated Show resolved Hide resolved
@weiji14 weiji14 mentioned this pull request Mar 5, 2024
7 tasks
Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggested changes to reduce diff from merge conflict handling.

docs/partial-inputs-flood-tutorial.ipynb Outdated Show resolved Hide resolved
docs/partial-inputs-flood-tutorial.ipynb Outdated Show resolved Hide resolved
docs/partial-inputs.ipynb Outdated Show resolved Hide resolved
docs/partial-inputs.ipynb Outdated Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
src/callbacks_wandb.py Outdated Show resolved Hide resolved
leothomas and others added 8 commits March 14, 2024 17:31
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
…dation/model into feature/dockerized-environment

COPY --chown=$MAMBA_USER:$MAMBA_USER . .

RUN micromamba create -y -n claymodel --file conda-lock.yml && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm getting this error when running docker build . -t clay --platform linux/amd64 locally:

0.384 Transaction starting
79.51 critical libmamba Failed to create dir 'info'
79.51 error    libmamba Error opening for reading "/opt/conda/pkgs/cudnn-8.9.7.29-h092f7fd_3/info/index.json": No such file or directory
79.51 error    libmamba Error when extracting package: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
79.51 cudnn-8.9.7.29-h092f7fd_3.conda extraction failed
79.73 critical libmamba Found incorrect download: cudnn. Aborting

Not sure if it's because something changed when I re-locked the conda-lock.yml file manually when doing the merge from main at 5be9fa2#diff-63113c19c5d310b8e350b302f279e2297ba6faa9ac9c99c1c82e83e508447865

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh jeez - I'm getting a segmentation fault:

2.612 Transaction starting
1441.7 qemu: uncaught target signal 11 (Segmentation fault) - core dumped
1441.7 bash: line 1:    10 Segmentation fault      micromamba create -y -n claymodel --file conda-lock.yml
------
Dockerfile:11
--------------------
  10 |
  11 | >>> RUN micromamba create -y -n claymodel --file conda-lock.yml && \
  12 | >>>     micromamba clean --all --yes
  13 |
--------------------
ERROR: failed to solve: process "/usr/local/bin/_dockerfile_shell.sh micromamba create -y -n claymodel --file conda-lock.yml &&     micromamba clean --all --yes" did not complete successfully: exit code: 139

Wondering if it's either related to a lack of available memory or installing CUDA in docker.

I recall that when we installed the model libraries locally we had to install pytorch without CUDA - could the CUDA installation have made its way into the conda-lock file and causing issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants