{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":65600975,"defaultBranch":"main","name":"pytorch","ownerLogin":"pytorch","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2016-08-13T05:26:41.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/21003710?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1717232970.0","currentOid":""},"activityList":{"items":[{"before":"7ef7c265d4361691dc4cf54152db083de3215fbf","after":"554265d4504108c1236035f8c957d3364f6c1123","ref":"refs/heads/viable/strict","pushedAt":"2024-06-01T09:49:39.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"pytorchmergebot","name":null,"path":"/pytorchmergebot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/97764156?s=80&v=4"},"commit":{"message":"[Inductor]: Use new device-agnostic libdevice import from triton.language (#127348)\n\nTriton refactored `libdevice` in https://github.com/triton-lang/triton/commit/5e6952d8c529770ff0321c8ded633c32af0ff9ea\n\nWhile both imports still appear to work under CUDA, this change is required to pull the correct libdevice variants under the Intel XPU backend. I am working on developing a test that catches this behavior. The easiest path would be to enable `test/inductor/test_triton_kernels.py` under the XPU backend, but a different group at Intel manages that test and I need to see if they already have an enabling plan.\n\nI am not sure the double `libdevice` import (see line 22 where I have the nolint flag) is really necessary but have yet to find a conclusive test case.\n\nPull Request resolved: https://github.com/pytorch/pytorch/pull/127348\nApproved by: https://github.com/etaf, https://github.com/peterbell10","shortMessageHtmlLink":"[Inductor]: Use new device-agnostic libdevice import from triton.lang…"}},{"before":"f4a8cd192acd8e8fc4c37de9d2386da14e73d62a","after":null,"ref":"refs/tags/ciflow/inductor/126545","pushedAt":"2024-06-01T09:09:28.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"7582246cadfca4ad93e13230691a7dac5540752c","after":null,"ref":"refs/tags/ciflow/inductor/126068","pushedAt":"2024-06-01T09:09:28.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"7582246cadfca4ad93e13230691a7dac5540752c","after":null,"ref":"refs/tags/ciflow/trunk/126068","pushedAt":"2024-06-01T09:09:27.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"7673d5b0440065188de6ff36cbc6f1fdc3758e26","after":"2ae3f7f38c84064f034c7517454c7d49abd64938","ref":"refs/heads/gh/jgong5/46/orig","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"[inductor][cpp] bf16/fp16 gemm template computed with fp32\n\nghstack-source-id: d81c7a7ad06717f9155c94f963eb322c6ec61954\nPull Request resolved: https://github.com/pytorch/pytorch/pull/126068","shortMessageHtmlLink":"[inductor][cpp] bf16/fp16 gemm template computed with fp32"}},{"before":"f503ea47a0a11c320a16d631feb7568dc02dcb18","after":"3fdb6c7ac1258d9917102ef41ca63db1c6237ed2","ref":"refs/heads/gh/jgong5/49/orig","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"[inductor][cpp] BF16 AMX micro-gemm support\n\nghstack-source-id: 3186527537549f760c3080349924efb4cfcf5242\nPull Request resolved: https://github.com/pytorch/pytorch/pull/127195","shortMessageHtmlLink":"[inductor][cpp] BF16 AMX micro-gemm support"}},{"before":"140998ca90735cb97049a8314d9f59aff9a7a0ed","after":"c9911e81885f8db7ad0a58026d9c1e0f69df1407","ref":"refs/heads/gh/jgong5/50/orig","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"[cpuinfo] bump cpuinfo to the latest to support amx isa check\n\nghstack-source-id: 75fd2420f6959772620f9af244b04379c3fef6a1\nPull Request resolved: https://github.com/pytorch/pytorch/pull/127505","shortMessageHtmlLink":"[cpuinfo] bump cpuinfo to the latest to support amx isa check"}},{"before":"f4a8cd192acd8e8fc4c37de9d2386da14e73d62a","after":null,"ref":"refs/tags/ciflow/trunk/126545","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"4d8d997161cfb9e7fc773b3048c3cd2c75eb4d1d","after":"d6c3424772375b2986b2895ea978c5cfe940a254","ref":"refs/heads/gh/jgong5/48/orig","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"[inductor][cpp] support bf16/fp16 gemm template epilogue fusion\n\nghstack-source-id: 70fc55829a8d52fcc6e826f670efb5f52a3a8a01\nPull Request resolved: https://github.com/pytorch/pytorch/pull/126545","shortMessageHtmlLink":"[inductor][cpp] support bf16/fp16 gemm template epilogue fusion"}},{"before":"836f45dde39880a6b6b3f426ae17aa9c124f4335","after":null,"ref":"refs/tags/ciflow/inductor/127195","pushedAt":"2024-06-01T09:09:26.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"7582246cadfca4ad93e13230691a7dac5540752c","after":"6add699a9f1014e815790a7464d824492a589c94","ref":"refs/heads/gh/jgong5/46/head","pushedAt":"2024-06-01T09:09:20.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update on \"[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion\"\n\n\r\nAs part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.\r\n\r\ncc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\nDifferential Revision: [D58017580](https://our.internmc.facebook.com/intern/diff/D58017580)\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update on \"[inductor][cpp] bf16/fp16 gemm template computed with fp32…"}},{"before":"b7d6fa3cfd7c1cc45e765f0e536e8742c0710e63","after":"892bc6516248b94978a0d622cdbd1e52ffa78bf5","ref":"refs/heads/gh/jgong5/50/head","pushedAt":"2024-06-01T09:09:20.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update on \"[cpuinfo] bump cpuinfo to the latest to support amx isa check\"\n\n\r\nFix https://github.com/pytorch/pytorch/issues/127368\r\n\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update on \"[cpuinfo] bump cpuinfo to the latest to support amx isa ch…"}},{"before":"f4a8cd192acd8e8fc4c37de9d2386da14e73d62a","after":"b685ea54fffa80476a265f87abc6db9bd3530b08","ref":"refs/heads/gh/jgong5/48/head","pushedAt":"2024-06-01T09:09:20.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update on \"[inductor][cpp] support bf16/fp16 gemm template epilogue fusion\"\n\n\r\nAs part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:\r\n1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: \"gemm + in-template epilogues -> template buffer\". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.\r\n2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.\r\n3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an \"output\" not an \"in-place\" buffer of the template kernel itself. Now, we use a dedicated \"aliases\" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.\r\n4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.\r\n\r\ncc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update on \"[inductor][cpp] support bf16/fp16 gemm template epilogue f…"}},{"before":"836f45dde39880a6b6b3f426ae17aa9c124f4335","after":"e71d2fd63fa74cb2ff42e3023a917b60831b6adc","ref":"refs/heads/gh/jgong5/49/head","pushedAt":"2024-06-01T09:09:20.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update on \"[inductor][cpp] BF16 AMX micro-gemm support\"\n\n\r\nThis PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`.\r\n\r\ncc mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update on \"[inductor][cpp] BF16 AMX micro-gemm support\""}},{"before":"1318f32f3a5d9464e2d9d08fd5eea64d4c75595e","after":"0dab245ba539946d458aeb75194f4ad37fe87dfa","ref":"refs/heads/gh/jgong5/48/base","pushedAt":"2024-06-01T09:09:15.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update base for Update on \"[inductor][cpp] support bf16/fp16 gemm template epilogue fusion\"\n\n\r\nAs part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:\r\n1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: \"gemm + in-template epilogues -> template buffer\". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.\r\n2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.\r\n3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an \"output\" not an \"in-place\" buffer of the template kernel itself. Now, we use a dedicated \"aliases\" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.\r\n4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.\r\n\r\ncc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update base for Update on \"[inductor][cpp] support bf16/fp16 gemm tem…"}},{"before":"676885ba16b4056a2524f1a58483854268c743e3","after":"1c109e664b6c5ff2d7947b63b3498b219b150c9c","ref":"refs/heads/gh/jgong5/50/base","pushedAt":"2024-06-01T09:09:15.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update base for Update on \"[cpuinfo] bump cpuinfo to the latest to support amx isa check\"\n\n\r\nFix https://github.com/pytorch/pytorch/issues/127368\r\n\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update base for Update on \"[cpuinfo] bump cpuinfo to the latest to su…"}},{"before":"5ccaf3e0645c4e857306f2c44e19b118396a59e4","after":"93128162e9c2a314213b1b8f1586c5c01784e013","ref":"refs/heads/gh/jgong5/46/base","pushedAt":"2024-06-01T09:09:15.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update base for Update on \"[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion\"\n\n\r\nAs part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.\r\n\r\ncc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\nDifferential Revision: [D58017580](https://our.internmc.facebook.com/intern/diff/D58017580)\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update base for Update on \"[inductor][cpp] bf16/fp16 gemm template co…"}},{"before":"295f5cca845686cee5bcd6cc4b43f723d422570c","after":"cb78d9d041e57b28d2e819fb6948be0b100c8195","ref":"refs/heads/gh/jgong5/49/base","pushedAt":"2024-06-01T09:09:15.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"jgong5","name":"Jiong Gong","path":"/jgong5","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8359223?s=80&v=4"},"commit":{"message":"Update base for Update on \"[inductor][cpp] BF16 AMX micro-gemm support\"\n\n\r\nThis PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`.\r\n\r\ncc mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang\n\n[ghstack-poisoned]","shortMessageHtmlLink":"Update base for Update on \"[inductor][cpp] BF16 AMX micro-gemm support\""}},{"before":"121c55d8d12a878b12eab00a7cebae2e2fa47ee7","after":"7ef7c265d4361691dc4cf54152db083de3215fbf","ref":"refs/heads/viable/strict","pushedAt":"2024-06-01T08:24:44.000Z","pushType":"push","commitsCount":20,"pusher":{"login":"pytorchmergebot","name":null,"path":"/pytorchmergebot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/97764156?s=80&v=4"},"commit":{"message":"Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659)\n\nhttps://en.cppreference.com/w/cpp/header/codecvt. This starts to fail on MacOS after migrating it to MacOS 14 with a newer toolchain. For example https://hud.pytorch.org/pytorch/pytorch/commit/57baae9c9b43fd31199dedd3f0fd5ed67faf5769.\n\nAs there is no clear alternative to the deprecated function yet, I just ack the warning to fix the build and complete the migration https://github.com/pytorch/pytorch/issues/127490\nPull Request resolved: https://github.com/pytorch/pytorch/pull/127659\nApproved by: https://github.com/kit1980, https://github.com/atalman","shortMessageHtmlLink":"Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659)"}},{"before":"53a766e2dce3dbaf2513814ee657f9fb32183b77","after":null,"ref":"refs/tags/ciflow/binaries/123475","pushedAt":"2024-06-01T07:40:54.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"53a766e2dce3dbaf2513814ee657f9fb32183b77","after":null,"ref":"refs/tags/ciflow/trunk/123475","pushedAt":"2024-06-01T07:40:53.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"53a766e2dce3dbaf2513814ee657f9fb32183b77","after":"b4db73408652547f00289202c1c5c666ee501b6a","ref":"refs/heads/updatecudnn9","pushedAt":"2024-06-01T07:40:45.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"nWEIdia","name":"Wei Wang","path":"/nWEIdia","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/143543872?s=80&v=4"},"commit":{"message":"Fix errors in #127589 (libtorch build)","shortMessageHtmlLink":"Fix errors in #127589 (libtorch build)"}},{"before":"4d012a7a2fd2a2f0c9106211a120aed85801a39b","after":"70b03a1d35b7988e9af8f64df787b650b12486a4","ref":"refs/heads/nightly","pushedAt":"2024-06-01T07:33:44.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"pytorchbot","name":null,"path":"/pytorchbot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/21957446?s=80&v=4"},"commit":{"message":"2024-06-01 nightly release (121c55d8d12a878b12eab00a7cebae2e2fa47ee7)","shortMessageHtmlLink":"2024-06-01 nightly release (121c55d)"}},{"before":"3058d24d13b5fb6419f65b69dbf2e29b119ca8ce","after":null,"ref":"refs/tags/ciflow/trunk/127589","pushedAt":"2024-06-01T07:29:34.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"3058d24d13b5fb6419f65b69dbf2e29b119ca8ce","after":null,"ref":"refs/tags/ciflow/inductor/127589","pushedAt":"2024-06-01T07:29:33.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"3058d24d13b5fb6419f65b69dbf2e29b119ca8ce","after":null,"ref":"refs/tags/ciflow/binaries/127589","pushedAt":"2024-06-01T07:29:31.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"f19718daed6d398f5e66a9d050c47c60a56050aa","after":null,"ref":"refs/tags/ciflow/trunk/126598","pushedAt":"2024-06-01T07:17:21.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"1b84107c3e0f99c6c39ee36dd9b075a66bba6921","after":null,"ref":"refs/tags/ciflow/inductor/127454","pushedAt":"2024-06-01T07:17:20.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"f19718daed6d398f5e66a9d050c47c60a56050aa","after":null,"ref":"refs/tags/ciflow/inductor/126598","pushedAt":"2024-06-01T07:17:19.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pytorch-bot[bot]","name":null,"path":"/apps/pytorch-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/40112?s=80&v=4"}},{"before":"f3ab47d4bfd1b9da55e36915011d237afb3722c1","after":"17f83e1e5a532a793c1775c604461da2fbf22484","ref":"refs/heads/gh/yifuwang/88/orig","pushedAt":"2024-06-01T07:17:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"yifuwang","name":"Yifu Wang","path":"/yifuwang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4156752?s=80&v=4"},"commit":{"message":"Improve the scheduling for fused_matmul_reduce_scatter\n\nIn fused_all_gather_matmul, each rank copies their shard into their\nlocal p2p buffer, performs a barrier, then performs (copy -> matmul) for\neach remote shard. The (copy -> matmul)s for remote shards run on two\nstreams without synchronization. This not only allows for\ncomputation/communication overlapping, but also computation/computation\noverlapping which alleviates the wave quantization effect caused by\ncomputation decomposition.\n\nHowever, the synchronization-free approach doesn't work well with\nfused_matmul_reduce_scatter, in which there's a barrier in every step.\nWithout synchronization between the two streams, a matmul in one stream\ncan delay a barrier in the other stream, further delaying the copy\nwaiting for the barrier.\n\nThis PR addresss the issue by adding synchronization between the two\nstreams such that the matmul of step i can only start after the barrier\nof step i-1 completes. With this approach, we lose the\ncomputation/computation overlapping, but avoid slowdown due to delayed\nbarrier.\n\nghstack-source-id: 92abb6af81fef5e4299bd5c0e1aaa5a14bdebb93\nPull Request resolved: https://github.com/pytorch/pytorch/pull/127455","shortMessageHtmlLink":"Improve the scheduling for fused_matmul_reduce_scatter"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEWajo-QA","startCursor":null,"endCursor":null}},"title":"Activity · pytorch/pytorch"}