Update base for Update on "[NVFuser] Upstream push 0907"

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
pytorch · Sep 19, 2022 · bfb7b15 · bfb7b15
2 parents 6c113ba + 9024015
commit bfb7b15
Show file tree

Hide file tree

Showing 683 changed files with 23,353 additions and 17,635 deletions.
diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh
@@ -379,7 +379,7 @@ docker build \
        --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
        --build-arg "KATEX=${KATEX:-}" \
        --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
-       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
+       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906}" \
        --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
        --build-arg "UCX_COMMIT=${UCX_COMMIT}" \
        --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

diff --git a/.circleci/docker/common/install_cudnn.sh b/.circleci/docker/common/install_cudnn.sh
@@ -4,7 +4,13 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
     # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
     mkdir tmp_cudnn && cd tmp_cudnn
     CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
-    curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then
+        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"
+        curl -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz
+    else
+        curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    fi
+
     tar xf ${CUDNN_NAME}.tar.xz
     cp -a ${CUDNN_NAME}/include/* /usr/include/
     cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

diff --git a/.circleci/docker/common/install_ucc.sh b/.circleci/docker/common/install_ucc.sh
@@ -36,7 +36,7 @@ function install_ucc() {
   git submodule update --init --recursive
 
   ./autogen.sh
-  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-nccl=no --with-cuda=$with_cuda
+  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda
   time make -j
   sudo make install
 

diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile
@@ -118,6 +118,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
 
 # Install CUDNN
 ARG CUDNN_VERSION
+ARG CUDA_VERSION
 COPY ./common/install_cudnn.sh install_cudnn.sh
 RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
 RUN rm install_cudnn.sh

diff --git a/.circleci/scripts/windows_cudnn_install.sh b/.circleci/scripts/windows_cudnn_install.sh
@@ -18,7 +18,7 @@ case ${CUDA_VERSION} in
         ;;
     11.7)
         # Use cudnn8.3 with hard-coded cuda11.5 version
-        cudnn_file_name="cudnn-windows-x86_64-8.3.2.44_cuda11.5-archive"
+        cudnn_file_name="cudnn-windows-x86_64-8.5.0.96_cuda11-archive"
         ;;
     *)
         echo "CUDA_VERSION: ${CUDA_VERSION} not supported yet"

diff --git a/.github/ci_commit_pins/torchdynamo.txt b/.github/ci_commit_pins/torchdynamo.txt
@@ -1 +1 @@
-fe3173f7e6c804e6330ac187ea8e4101f45ff9a2
+41c44bc1d080d6cf063419a4166732b983b84eef
diff --git a/.github/ci_commit_pins/vision.txt b/.github/ci_commit_pins/vision.txt
@@ -1 +1 @@
-84dcf695d64c15f8a0be845ac65901bdde845429
+a4f53308b2d0f1aa9191686e326f45c26053f686
diff --git a/.github/ci_commit_pins/xla.txt b/.github/ci_commit_pins/xla.txt
@@ -1 +1 @@
-b8688ee3c03120a15978844db6c4fa73eceb6594
+4dec902617aea14ca4013e402eea56e92701cac9
diff --git a/.github/merge_rules.yaml b/.github/merge_rules.yaml
@@ -3,6 +3,7 @@
   - .jenkins/caffe2/*
   - aten/src/ATen/core/interned_strings.h
   - docs/source/onnx.rst
+  - docs/source/onnx*
   - docs/source/scripts/onnx/**
   - scripts/onnx/**
   - test/jit/test_export_modes.py
@@ -15,6 +16,8 @@
   - torch/csrc/jit/serialization/onnx.*
   - torch/csrc/onnx/**
   - torch/onnx/**
+  - third_party/onnx
+  - caffe2/python/onnx/**
   approved_by:
   - BowenBao
   - abock
@@ -323,6 +326,7 @@
   - '*'
   approved_by:
   - pytorch/metamates
+  - mruberry
   mandatory_checks_name:
   - Facebook CLA Check
   - Lint

diff --git a/.github/scale-config.yml b/.github/scale-config.yml
diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py
@@ -13,7 +13,7 @@
 from typing import Dict, List, Tuple, Optional
 
 
-CUDA_ARCHES = ["10.2", "11.3", "11.6", "11.7"]
+CUDA_ARCHES = ["10.2", "11.6", "11.7"]
 
 
 ROCM_ARCHES = ["5.1.1", "5.2"]

diff --git a/.github/scripts/generate_ci_workflows.py b/.github/scripts/generate_ci_workflows.py
@@ -207,15 +207,6 @@ class OperatingSystem:
     ),
 ]
 WINDOWS_BINARY_SMOKE_WORKFLOWS = [
-    BinaryBuildWorkflow(
-        os=OperatingSystem.WINDOWS,
-        package_type="wheel",
-        build_configs=generate_binary_build_matrix.generate_wheels_matrix(
-            OperatingSystem.WINDOWS,
-            arches=["11.3"],
-            python_versions=["3.7"]),
-        branches="master",
-    ),
     BinaryBuildWorkflow(
         os=OperatingSystem.WINDOWS,
         package_type="libtorch",

diff --git a/.github/scripts/run_torchbench.py b/.github/scripts/run_torchbench.py
@@ -13,10 +13,12 @@
 # 1. Does not reuse the build artifact in other CI workflows
 # 2. CI jobs are serialized because there is only one worker
 import os
+import boto3  # type: ignore[import]
 import git  # type: ignore[import]
 import pathlib
 import argparse
 import subprocess
+from pathlib import Path
 
 from typing import List, Tuple
 
@@ -31,6 +33,25 @@
 direction: decrease
 timeout: 720
 tests:"""
+S3_BUCKET = "ossci-metrics"
+S3_PREFIX = "torchbench-pr-test"
+S3_URL_BASE = f"https://{S3_BUCKET}.s3.amazonaws.com/"
+
+class S3Client:
+    def __init__(self, bucket: str = S3_BUCKET, prefix: str = S3_PREFIX):
+        self.s3 = boto3.client('s3')
+        self.resource = boto3.resource('s3')
+        self.bucket = bucket
+        self.prefix = prefix
+
+    def upload_file(self, file_path: Path, filekey_prefix: str) -> None:
+        assert file_path.is_file(), f"Specified file path {file_path} does not exist or not file."
+        file_name = file_path.name
+        s3_key = f"{self.prefix}/{filekey_prefix}/{file_name}"
+        print(f"Uploading file {file_name} to S3 with key: {s3_key}")
+        self.s3.upload_file(str(file_path), self.bucket, s3_key)
+        # output the result URL
+        print(f"Uploaded the result file {file_name} to {S3_URL_BASE}{s3_key}")
 
 def gen_abtest_config(control: str, treatment: str, models: List[str]) -> str:
     d = {}
@@ -137,9 +158,21 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
     print(f"Running torchbench userbenchmark command: {command}")
     subprocess.check_call(command, cwd=torchbench_path, env=env)
 
+def process_upload_s3(result_dir: str) -> None:
+    # validate result directory
+    result_dir_path = Path(result_dir)
+    assert result_dir_path.exists(), f"Specified result directory {result_dir} doesn't exist."
+    # upload all files to S3 bucket oss-ci-metrics
+    files = [x for x in result_dir_path.iterdir() if x.is_file()]
+    # upload file to S3 bucket
+    s3_client: S3Client = S3Client()
+    filekey_prefix = result_dir_path.name
+    for f in files:
+        s3_client.upload_file(f, filekey_prefix)
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description='Run TorchBench tests based on PR')
-    parser.add_argument('--pr-body', required=True, help="The file that contains body of a Pull Request")
+    parser.add_argument('--pr-body', help="The file that contains body of a Pull Request")
 
     subparsers = parser.add_subparsers(dest='command')
     # parser for setup the torchbench branch name env
@@ -151,6 +184,9 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
     run_parser.add_argument('--pr-head-sha', required=True, type=str, help="The Pull Request head hash")
     run_parser.add_argument('--pytorch-path', required=True, type=str, help="Path to pytorch repository")
     run_parser.add_argument('--torchbench-path', required=True, type=str, help="Path to TorchBench repository")
+    # parser to upload results to S3
+    upload_parser = subparsers.add_parser("upload-s3")
+    upload_parser.add_argument('--result-dir', required=True, type=str, help="Path to benchmark output")
     args = parser.parse_args()
 
     if args.command == 'set-torchbench-branch':
@@ -181,6 +217,8 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
         if not models and not userbenchmarks:
             print("Can't parse valid models or userbenchmarks from the pr body. Quit.")
             exit(-1)
+    elif args.command == 'upload-s3':
+        process_upload_s3(args.result_dir)
     else:
         print(f"The command {args.command} is not supported.")
         exit(-1)
diff --git a/.github/scripts/trymerge.py b/.github/scripts/trymerge.py
@@ -912,6 +912,8 @@ def merge_into(self, repo: GitRepo, *,
 
         repo.push(self.default_branch(), dry_run)
         if not dry_run:
+            if land_check_commit:
+                self.delete_land_time_check_branch(repo)
             gh_add_labels(self.org, self.project, self.pr_num, ["merged"])
 
     def merge_changes(self,
@@ -962,6 +964,11 @@ def create_land_time_check_branch(self,
             repo.checkout(orig_branch)
         return commit
 
+    def delete_land_time_check_branch(self,
+                                      repo: GitRepo) -> None:
+        land_check_branch = f'landchecks/{self.pr_num}'
+        repo._run_git('push', 'origin', '-d', land_check_branch)
+
 
 class MandatoryChecksMissingError(Exception):
     pass
@@ -1344,7 +1351,7 @@ def merge(pr_num: int, repo: GitRepo,
     # here to stop the merge process right away
     find_matching_merge_rule(pr, repo, skip_mandatory_checks=True)
 
-    if land_checks:
+    if land_checks and not dry_run:
         land_check_commit = pr.create_land_time_check_branch(
             repo,
             'viable/strict',
@@ -1354,6 +1361,8 @@ def merge(pr_num: int, repo: GitRepo,
 
     gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message(land_check_commit))
     if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days:
+        if land_checks and not dry_run:
+            pr.delete_land_time_check_branch(repo)
         raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.")
 
     start_time = time.time()
@@ -1366,6 +1375,8 @@ def merge(pr_num: int, repo: GitRepo,
         print(f"Attempting merge of https://github.com/{org}/{project}/pull/{pr_num} ({elapsed_time / 60} minutes elapsed)")
         pr = GitHubPR(org, project, pr_num)
         if initial_commit_sha != pr.last_commit()['oid']:
+            if land_checks and not dry_run:
+                pr.delete_land_time_check_branch(repo)
             raise RuntimeError("New commits were pushed while merging. Please rerun the merge command.")
         try:
             find_matching_merge_rule(pr, repo)
@@ -1400,10 +1411,16 @@ def merge(pr_num: int, repo: GitRepo,
             last_exception = str(ex)
             print(f"Merge of https://github.com/{org}/{project}/pull/{pr_num} failed due to: {ex}. Retrying in 5 min")
             time.sleep(5 * 60)
+        except RuntimeError:
+            if land_checks and not dry_run:
+                pr.delete_land_time_check_branch(repo)
+            raise
     # Finally report timeout back
     msg = f"Merged timed out after {timeout_minutes} minutes. Please contact the pytorch_dev_infra team."
     msg += f"The last exception was: {last_exception}"
     if not dry_run:
+        if land_checks:
+            pr.delete_land_time_check_branch(repo)
         gh_add_labels(org, project, pr_num, ["land-failed"])
     raise RuntimeError(msg)
 

diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
@@ -117,6 +117,7 @@ jobs:
           NUM_TEST_SHARDS: ${{ matrix.num_shards }}
           PR_BODY: ${{ github.event.pull_request.body }}
           SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
           SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
           DOCKER_IMAGE: ${{ inputs.docker-image }}
           XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
@@ -171,6 +172,7 @@ jobs:
             -e PR_LABELS \
             -e MAX_JOBS="$(nproc --ignore=2)" \
             -e SCCACHE_BUCKET \
+            -e SCCACHE_S3_KEY_PREFIX \
             -e XLA_CUDA \
             -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
             --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \