Skip to content

Latest commit

 

History

History
145 lines (93 loc) · 5.09 KB

File metadata and controls

145 lines (93 loc) · 5.09 KB

CI setup

The CI is setup with github actions using the on-demand EC2 backend.

This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.

Unfortunately this only works for PRs created from non-forked branches

The workflow file

The workflow file is at .github/workflows/main.yml

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - name: Start EC2 runner
        id: start-ec2-runner
        uses: machulav/ec2-github-runner@v2
        with:
          mode: start
          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
          ec2-image-id: ami-0dfaabfa78a779fbc
          ec2-instance-type: p3.8xlarge
          subnet-id: subnet-3502b45e
          security-group-id: sg-e8f46d9d

It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.

Connect to instance

To pre-install things connect to the instance manually and install what's desired

  1. choose and start an EC2 instance
  2. connect to it as ubuntu, then sudo su as the runner runs as root. I couldn't find a way around it.
ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" ubuntu@ec2-3-14-127-35.us-east-2.compute.amazonaws.com

Once installed, stop the instance.

Then create a new AMI (see below) and update the script using the new AMI.

Prepare the machine

Steps used to setup fixed software (won't be installed at test time)

install fixed packages

  • torch 1.9.0/cu-11.1
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
  • all kinds of prerequisites
pip install transformers
wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
pip install -r requirements-ds.txt
wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
pip install -r requirements-ms.txt

  • apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:

XXX: this no longer works - had to manually patch pytorch to avoid mismatch failure

--- a/setup.py
+++ b/setup.py
@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
     print(raw_output + "from " + cuda_dir + "/bin\n")

     if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+        return
         raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                            "not match the version used to compile Pytorch binaries.  " +
                            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +

install it: (it was cloned from git clone https://github.com/NVIDIA/apex)

cd code/apex
# I copied this script from my setup
./build.sh

make a new AMI image

Once the needed things got installed (and every time anything new is installed) a new AMI must be created (this is like an .iso image snapshot)

  1. go to https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
  2. choose the instance to create a new image from
  3. Actions -> Image and Templates -> Create Image

Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.

The process of creating the image can be done while the instance that has been updated is still running.

Just don't forget to turn the instance off when validated it to work.

Finally, once created, the script needs to be updated to that new AMI id (key ec2-image-id) in .github/workflows/main.py

Stop instance alarm

It looks like occasionally the instance doesn't stop and continues running.

I added a stop alarm to automatically kill the instance after 1h if util < 10% following the exact instructions from: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html

Guides

Set up guide: https://github.com/machulav/ec2-github-runner

Launching an EC2 instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html