Skip to content

A quick guide on how to add and cache dependencies on PyTorch CI

Huy Do edited this page Mar 24, 2023 · 5 revisions

Context

With the high volume of pull requests to PyTorch and the extensive scope of PyTorch test suite, PyTorch CI runs thousand of build and test jobs daily across multiple platforms including Linux (CPU, CUDA, ROCm), Windows (CPU, CUDA), MacOS (x86_64, M1), Android, and iOS. This makes stability the most important of the CI. Because stability includes many areas and means many things, we will focus only on CI dependencies in this wiki. Specifically, this is about how to add and cache dependencies safely and reliably.

Most softwares are built on top of others and requires their dependencies to be in place to work. PyTorch CI is no difference. Over the years, some of the most common issues about setting up dependencies are:

  • A dependency is not pinned. It's updated upstream and unexpectedly break stuffs.
  • A dependency is not cached and is setup from scratch every time a job requiring it is run. This usually means downloading and installing something, i.e. a Conda package, from somewhere, i.e. the Internet. Doing this multiple time and it's bound to fail flakily in some cases.

Like doing a science experiment, the solutions would be 1) to run the CI jobs in a control manner with fixed parameters and reproducible results and 2) to have all the dependencies close at hand. They are achieved by:

  1. Pin all CI dependencies to specific versions and update them explicitly.
  2. Put all Linux dependencies into Docker images.
  3. Put all Windows dependencies into the Windows AMI used to launch Windows runners.
  4. Cache all MacOS dependencies with GitHub cache.

Adding a new CI dependency would most likely be done across all major platforms Linux, Windows, and MacOS. So, let's go over the details for each platform.

Linux dependencies

Common Linux CI pip dependencies are specify in Docker requirements-ci.txt and are installed for the selected Python version in install_conda.sh. Pinning the version is recommended, i.e. lintrunner==0.10.7, unless there is a compelling reason to not do so. Adding a dependency here will make it available to all Linux build and test jobs (easy peasy).

For specific cases, it's also ok to install conda and pip dependencies as part of the Docker build script. For example, ONNX dependencies are install in install_onnx.sh, which are available only in pytorch-linux-focal-py3-clang10-onnx Docker image. The list of Docker images and what they have could be found in Docker build.sh. #96590 is a good example on how this could be done. The process is roughly as follows:

  1. Prepare the list of required dependencies, i.e. install_onnx.sh
  2. Identify the Dockerfile(s) to update. There are currently three flavors:
    1. Ubuntu, for generic Linux CPU jobs
    2. Ubuntu with CUDA, for Linux CUDA jobs of course
    3. Ubuntu with ROCm, for Linux ROCm
  3. Add the script to the Dockerfile(s) to be built as a new Docker layer, for example:
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./
RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi
RUN rm install_onnx.sh common_utils.sh

Going beyond conda and pip, we could also put frequently used ML models into the Docker image to avoid hitting external hosts like https://huggingface.co excessively. To go a bit into the details here, we host Docker images used by the CI on AWS ECR where the majority of our EC2 runners reside. So getting Docker images with cached models from ECR is both cheaper, faster, and more reliable than getting them from external sources. Fortunately, it's relatively easy to cache all these models by just getting them once when building the image, i.e. #96793. They will be readily available in the runner's cache directory for later use.

Windows dependencies

At the moment, there is no Docker support on Windows, so preparing Windows dependencies is still a cumbersome process. Please reach out to PyTorch Dev Infra team if you need support during the process.

  1. Windows AMI definition is in test-infra repo and it's written using Packer. So, the first step is to install it in your dev box.
  2. Get familiar with PowerShell on Windows. You can also take a look at some existing scripts that we use.
  3. Follow the instruction to build and publish a new AMI by running packer build -var 'skip_create_ami=false' .. There are some caveats:
    1. Only PyTorch Dev Infra team with access to our AWS account could publish a new AMI
    2. There is only one Windows AMI shared between CPU and CUDA jobs. This is a known pain point that has yet been addressed.
  4. Depending on the types of dependencies, there are:
    1. Install-Conda-Dependencies.ps1 to install all Conda dependencies. Again, pin the version whenever you can. Questions would be asked otherwise.
    2. Install-Pip-Dependencies.ps1 to install all pip dependencies.
    3. Other dependencies are covered in their own PowerShells scripts, i.e. Windows CUDA. Create a new script if you need to setup a new dependency.

Once the build successes and a new AMI is published. It's time to test it before rolling it out to production. This is done on pytorch-canary:

  1. Update the Windows AMI used by pytorch-canary here
  2. Once merged, submit a pull request to pytorch-canary, i.e. #158, to run Windows build and test jobs.
    1. The first thing is to confirm that the new AMI is used. This information could be found in the Setup Windows step where the AMI ID is shown.
    2. Ensure that all Windows jobs including binaries ones pass.
  3. Finally, update the Windows AMI used by pytorch here

MacOS dependencies

Both Docker and AMI options are not available for MacOS at the moment, so its dependencies couldn't be setup beforehand and they are still downloaded when the CI jobs run. Nevertheless, we use GitHub cache to cache all Conda and pip dependencies to minimize the flakiness. Adding a new dependency could then be done in a straightforward way by putting it into:

  1. Pinned Conda dependencies are in .github/requirements/conda-env-OS-ARCH.txt environment files. They are:
    1. conda-env-macOS-ARM64.txt (MacOS M1)
    2. conda-env-macOS-X64.txt (MacOS x86_64)
    3. conda-env-iOS.txt (iOS)
  2. Pinned pip dependencies are in .github/requirements/pip-requirements-OS.txt requirements files. They are:
    1. pip-requirements-macOS.txt (MacOS M1 and x86_64)
    2. pip-requirements-iOS.txt (iOS)

Similarly, a new Conda env or pip requirements file could be added for other platforms. The recommended way is to pass the file to setup-miniconda and let the action download, install, and cache all the dependencies automatically, i.e. _mac-test.yml workflow:

- name: Setup miniconda (arm64, py3.9)
  if: ${{ runner.arch == 'ARM64' }}
  uses: pytorch/test-infra/.github/actions/setup-miniconda@main
  with:
    python-version: 3.9
    environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
    pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
Clone this wiki locally