{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":130725814,"defaultBranch":"master","name":"apex","ownerLogin":"NVIDIA","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2018-04-23T16:28:52.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/1728152?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1714280974.0","currentOid":""},"activityList":{"items":[{"before":"a7de60e57f0534266841e1733262601ad76aaa74","after":"4138d31ff0acf4071d1dc001ccb7cd6e00800324","ref":"refs/heads/24.04.01-devel","pushedAt":"2024-04-28T04:50:32.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"Aidyn-A","name":null,"path":"/Aidyn-A","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31858918?s=80&v=4"},"commit":{"message":"Enhance Distributed Fused Adam (#1794)","shortMessageHtmlLink":"Enhance Distributed Fused Adam (#1794)"}},{"before":null,"after":"a7de60e57f0534266841e1733262601ad76aaa74","ref":"refs/heads/24.04.01-devel","pushedAt":"2024-04-28T04:49:34.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"Aidyn-A","name":null,"path":"/Aidyn-A","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31858918?s=80&v=4"},"commit":{"message":"Fix reduce_blocks_into_lanes race condition (#1798)\n\n* move __sync_threads() outside if branch\r\n\r\n* add clarifying comment","shortMessageHtmlLink":"Fix reduce_blocks_into_lanes race condition (#1798)"}},{"before":"f3f049246e5bdf6fdddf251ebe6b65dd4ca1ee29","after":"a7de60e57f0534266841e1733262601ad76aaa74","ref":"refs/heads/master","pushedAt":"2024-04-26T06:29:23.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Fix reduce_blocks_into_lanes race condition (#1798)\n\n* move __sync_threads() outside if branch\r\n\r\n* add clarifying comment","shortMessageHtmlLink":"Fix reduce_blocks_into_lanes race condition (#1798)"}},{"before":"6038fc1a364256c52d58fddb4bb0695cf4bbf60e","after":"f3f049246e5bdf6fdddf251ebe6b65dd4ca1ee29","ref":"refs/heads/master","pushedAt":"2024-04-24T19:35:07.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"Aidyn-A","name":null,"path":"/Aidyn-A","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31858918?s=80&v=4"},"commit":{"message":"NCCL userbuffer for DP RS in DistOpt (#1797)\n\n* NCCL userbuffer for AG/RS in DistOpt\r\n\r\nSigned-off-by: qiyuw \r\n\r\n* remove empty line\r\n\r\nSigned-off-by: qiyuw \r\n\r\n* Add test case\r\n\r\nSigned-off-by: qiyuw \r\n\r\n* fix an issue\r\n\r\nSigned-off-by: Qiyu Wan \r\n\r\n---------\r\n\r\nSigned-off-by: qiyuw \r\nSigned-off-by: Qiyu Wan \r\nCo-authored-by: qiyuw \r\nCo-authored-by: Qiyu Wan ","shortMessageHtmlLink":"NCCL userbuffer for DP RS in DistOpt (#1797)"}},{"before":"b5df1ccf89d8013556b1d1d823fc34268cae8e9c","after":"6038fc1a364256c52d58fddb4bb0695cf4bbf60e","ref":"refs/heads/master","pushedAt":"2024-04-24T19:34:30.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"Aidyn-A","name":null,"path":"/Aidyn-A","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31858918?s=80&v=4"},"commit":{"message":"Add nccl_allocator for zero-copy user buffer (#1796)\n\n* add nccl_allocator for zero-copy user buffer\r\n\r\n* review comments","shortMessageHtmlLink":"Add nccl_allocator for zero-copy user buffer (#1796)"}},{"before":"c5f6b7958922d5fb730ea7172309a0dbd43033c1","after":"b5df1ccf89d8013556b1d1d823fc34268cae8e9c","ref":"refs/heads/master","pushedAt":"2024-04-19T05:13:09.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Add 2D Fused RoPE (#1784)\n\n* add 2D fused RoPE\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* Update fused_rotary_positional_embedding.h\r\n\r\n---------\r\n\r\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"Add 2D Fused RoPE (#1784)"}},{"before":"810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c","after":"c5f6b7958922d5fb730ea7172309a0dbd43033c1","ref":"refs/heads/master","pushedAt":"2024-04-19T04:17:24.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"move to correct device for v1 state (#1783)","shortMessageHtmlLink":"move to correct device for v1 state (#1783)"}},{"before":"b496d85fb88a801d8e680872a12822de310951fd","after":"810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c","ref":"refs/heads/master","pushedAt":"2024-03-12T04:38:33.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Update test_fused_softmax.py (#1782)","shortMessageHtmlLink":"Update test_fused_softmax.py (#1782)"}},{"before":"5b67cd5f6b5174ef21a7190fc24583ce52e7187e","after":"b496d85fb88a801d8e680872a12822de310951fd","ref":"refs/heads/master","pushedAt":"2024-02-08T01:28:57.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Support scaled optimizer state in distributed Adam optimizer (#1771)\n\n* Add distopt support for scaled states\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Debug distopt checkpointing with scaled optimizer state\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Debug inconsistent variable name\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Debug checkpointing\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Complain if scaling fp32 states\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Make sure state scaling is done in fp32\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Change from per-parameter scaling factors to per-fragment\r\n\r\nCall _check_params_shard_dtypes within _local_step. Fuse scaling factor computation.\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Support overlapping first bucket AG with scaled state\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Correctly load in per-param-group settings from checkpoint\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Handle with contiguous param buffer and int param sync dtype\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Tweak docs\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Fix excessive memory usage with scaled optim state\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Silence warning about autograd through broadcast\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Debug tests with multiple models\r\n\r\nShows up in PyTorch builds starting 20240118.\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n---------\r\n\r\nSigned-off-by: Tim Moon ","shortMessageHtmlLink":"Support scaled optimizer state in distributed Adam optimizer (#1771)"}},{"before":"7e239f7534562c88dd03e2d3919ed1ec8a872a1f","after":"5b67cd5f6b5174ef21a7190fc24583ce52e7187e","ref":"refs/heads/master","pushedAt":"2024-02-07T20:52:50.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"Aidyn-A","name":null,"path":"/Aidyn-A","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31858918?s=80&v=4"},"commit":{"message":"Add GPUDirect Storage (#1774)\n\n* add gpu_direct_storage\r\n\r\n* apply suggested changes\r\n\r\n* use OOP API","shortMessageHtmlLink":"Add GPUDirect Storage (#1774)"}},{"before":"141bbf1cf362d4ca4d94f4284393e91dda5105a5","after":"7e239f7534562c88dd03e2d3919ed1ec8a872a1f","ref":"refs/heads/master","pushedAt":"2024-02-07T07:25:31.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Skip the p2p test on single GPU platforms (#1775)","shortMessageHtmlLink":"Skip the p2p test on single GPU platforms (#1775)"}},{"before":"6c8f384b40a596bbed960f5e8d9a808ebd0e93d8","after":"141bbf1cf362d4ca4d94f4284393e91dda5105a5","ref":"refs/heads/master","pushedAt":"2024-01-25T04:40:36.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Update test_adam.py (#1772)","shortMessageHtmlLink":"Update test_adam.py (#1772)"}},{"before":"48c4894c4b38b2b77cd7a0473ca665e89c9c148b","after":"6c8f384b40a596bbed960f5e8d9a808ebd0e93d8","ref":"refs/heads/master","pushedAt":"2024-01-18T16:11:01.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Update test_bottleneck_module.py - Skip BottleNeck Peer Memory Test if Not Supported (#1769)\n\nIf hw configuration disabled peer memory access, skip the bottleneck tests.","shortMessageHtmlLink":"Update test_bottleneck_module.py - Skip BottleNeck Peer Memory Test i…"}},{"before":"f058162b215791b15507bb542f22ccfde49c872d","after":"48c4894c4b38b2b77cd7a0473ca665e89c9c148b","ref":"refs/heads/master","pushedAt":"2024-01-12T17:25:39.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Update test_transducer_joint.py (#1767)\n\nIncrease tolerance to workaround unit test failures \r\n\r\n torch.testing.assert_close(f_grad_ref, f_grad_tst, atol=1e-5, rtol=1e-5)\r\nMismatched elements: 1 / 205636 (0.0%)\r\nGreatest absolute difference: 3.0517578125e-05 at index (3, 27, 390) (up to 1e-05 allowed)\r\nGreatest relative difference: 0.000492095947265625 at index (3, 27, 390) (up to 1e-05 allowed)\r\n\r\n torch.testing.assert_close(g_grad_ref, g_grad_tst, atol=1e-4, rtol=1e-4)\r\nMismatched elements: 1 / 51200 (0.0%)\r\nGreatest absolute difference: 0.0009765625 at index (0, 15, 280) (up to 0.0001 allowed)\r\nGreatest relative difference: 0.0008397102355957031 at index (0, 15, 280) (up to 0.0001 allowed)","shortMessageHtmlLink":"Update test_transducer_joint.py (#1767)"}},{"before":"e9789cc46c3189c9652df3e5752aa3c56909767e","after":"f058162b215791b15507bb542f22ccfde49c872d","ref":"refs/heads/master","pushedAt":"2024-01-12T04:40:32.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Fused RoPE for `thd` format (#1756)\n\n* fused rope for thd format\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* update the test\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* update test\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* remove redudant arguments\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* add comments & simplify code\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n---------\r\n\r\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"Fused RoPE for thd format (#1756)"}},{"before":"87c4debde8000636ab60b0fc477f324af789c1f7","after":"e9789cc46c3189c9652df3e5752aa3c56909767e","ref":"refs/heads/master","pushedAt":"2024-01-10T15:03:09.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Increase tolerance to workaround tolerance issues on A100 (#1766)\n\nfailures happen with absolute difference of ~0.001190185546875 and relative diff of ~0.0306854248046875.","shortMessageHtmlLink":"Increase tolerance to workaround tolerance issues on A100 (#1766)"}},{"before":"c07a4cf67102b9cd3f97d1ba36690f985bae4227","after":"87c4debde8000636ab60b0fc477f324af789c1f7","ref":"refs/heads/master","pushedAt":"2024-01-05T08:03:05.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"64-bit indexing Adam (#1765)\n\n* all i want for christmas is larger binaries and longer compile times\r\n\r\n* actually compare\r\n\r\n* woops","shortMessageHtmlLink":"64-bit indexing Adam (#1765)"}},{"before":"ccffcc43489f2d3556eab2cff1953e4962fba5b4","after":"c07a4cf67102b9cd3f97d1ba36690f985bae4227","ref":"refs/heads/master","pushedAt":"2024-01-01T05:17:33.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Make fused layer norm functions backward-compatible (#1760)\n\nSigned-off-by: Tim Moon ","shortMessageHtmlLink":"Make fused layer norm functions backward-compatible (#1760)"}},{"before":"5d89c04b07e2d0bd99f915705dc5af2e0c358eec","after":"ccffcc43489f2d3556eab2cff1953e4962fba5b4","ref":"refs/heads/master","pushedAt":"2023-12-15T04:18:45.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"[contrib] Improve FusedAdamSWA interface and add unit tests (#1759)\n\nWhy?\r\n- FusedAdamSWA interface was loosely typed and error-prone\r\n- The training critical path of FusedAdamSWA (i.e., its step function)\r\n could contain unnecessary GPU-host sync when grad_clip_scale is set\r\n to a non-CUDA-tensor variable\r\n- FusedAdamSWA didn't have any unit test\r\n\r\nWhat?\r\n- Encapsulated FusedAdamSWA math types and internal numerical type into\r\n Python enumerations to improve type robustness and readability\r\n- Accept grad_clip_scale as either a tensor or a number, for the latter\r\n case we move it to GPU in a non-blocking manner to eliminate a\r\n GPU-host sync\r\n- Add unit test to guarentee numerical correctness and demostrate usage\r\n\r\nCo-authored-by: Masaki Kozuki ","shortMessageHtmlLink":"[contrib] Improve FusedAdamSWA interface and add unit tests (#1759)"}},{"before":"37d83fce4dcbb59897dfd951906493a6fe7fae37","after":"5d89c04b07e2d0bd99f915705dc5af2e0c358eec","ref":"refs/heads/master","pushedAt":"2023-12-15T03:44:21.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"add async copy for openfold swa (#1758)\n\nCo-authored-by: Feiwen Zhu ","shortMessageHtmlLink":"add async copy for openfold swa (#1758)"}},{"before":"7548f68179b3058b79b66b3a3ecc4bb156eefc10","after":"37d83fce4dcbb59897dfd951906493a6fe7fae37","ref":"refs/heads/master","pushedAt":"2023-11-29T01:03:00.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"[FusedRoPE] Fuse type conversion and cos/sin (#1752)\n\n* minor fix\r\n\r\n* fuse type conversion\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* fuse cos/sin\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* update comments\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* fix typo\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* lint\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* use TORCH_CHECK instead of AT_ERROR\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n---------\r\n\r\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"[FusedRoPE] Fuse type conversion and cos/sin (#1752)"}},{"before":"dc5fa388cf297aa679ffa7cd98478e18defb6248","after":"7548f68179b3058b79b66b3a3ecc4bb156eefc10","ref":"refs/heads/master","pushedAt":"2023-11-28T02:32:14.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Use recommended PyTorch methods to silence warnings (#1754)\n\nGetting warnings of the following form:\r\n\r\n```\r\n/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)\r\n self._dummy_overflow_buf = torch.cuda.IntTensor([0])\r\n```","shortMessageHtmlLink":"Use recommended PyTorch methods to silence warnings (#1754)"}},{"before":"a2f6683b10fb4c29ab57c9e3d16957db76a8a5ba","after":"dc5fa388cf297aa679ffa7cd98478e18defb6248","ref":"refs/heads/master","pushedAt":"2023-11-23T01:36:08.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Avoid `.contiguous()` in fused RoPE (#1751)\n\n* avoid input.contiguous() in fused_rope\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* add transpose_output_memory\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n---------\r\n\r\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"Avoid .contiguous() in fused RoPE (#1751)"}},{"before":"fd4ae7d18f4b5150050bdbebb31b2d160413671d","after":"a2f6683b10fb4c29ab57c9e3d16957db76a8a5ba","ref":"refs/heads/master","pushedAt":"2023-11-20T05:30:06.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Distributed optimizer support for contiguous param buffer with FP8 params (#1749)\n\n* Debug distopt contiguous param buffers with uint8 param all-gathers\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Add test\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n* Avoid temporary buffer for param shard in optim step if possible\r\n\r\nSigned-off-by: Tim Moon \r\n\r\n---------\r\n\r\nSigned-off-by: Tim Moon ","shortMessageHtmlLink":"Distributed optimizer support for contiguous param buffer with FP8 pa…"}},{"before":"08f740290f999296d319ed2e3f21cd00b810918a","after":"fd4ae7d18f4b5150050bdbebb31b2d160413671d","ref":"refs/heads/master","pushedAt":"2023-11-16T09:58:00.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"fix a bug in fused rope (#1750)\n\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"fix a bug in fused rope (#1750)"}},{"before":"97e38d6255f4f5c95cc5fe368cccdd68e97e5865","after":"08f740290f999296d319ed2e3f21cd00b810918a","ref":"refs/heads/master","pushedAt":"2023-11-14T05:38:09.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"A fused `apply_rotary_pos_emb` implementation for Megatron-Core (#1746)\n\n* fused rope\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* add checks and a unit test\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* use better block size\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n* add fused_rope to functional\r\n\r\nSigned-off-by: Xin Yao \r\n\r\n---------\r\n\r\nSigned-off-by: Xin Yao ","shortMessageHtmlLink":"A fused apply_rotary_pos_emb implementation for Megatron-Core (#1746)"}},{"before":"acd89502dd5d6d733c29e6c6df3945ddaf8e509e","after":"97e38d6255f4f5c95cc5fe368cccdd68e97e5865","ref":"refs/heads/master","pushedAt":"2023-11-10T02:04:13.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Misc Changes (#1747)\n\n* Update run.sh\r\n\r\n* Update p2p_communication.py","shortMessageHtmlLink":"Misc Changes (#1747)"}},{"before":"9fc94b7d6db1b178adf9a6e92750f070dd9f825d","after":"acd89502dd5d6d733c29e6c6df3945ddaf8e509e","ref":"refs/heads/master","pushedAt":"2023-11-08T07:59:25.000Z","pushType":"pr_merge","commitsCount":5,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Add precision comments.","shortMessageHtmlLink":"Add precision comments."}},{"before":"ddd9b3f8bbdd21d300914e4610a2fcd5acc6b292","after":"9fc94b7d6db1b178adf9a6e92750f070dd9f825d","ref":"refs/heads/master","pushedAt":"2023-10-19T23:55:26.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Specify `rtol` as well as `atol` in test_fmha.py (#1743)","shortMessageHtmlLink":"Specify rtol as well as atol in test_fmha.py (#1743)"}},{"before":"19cc873541f9208c17e97538b2e84295892dd992","after":"ddd9b3f8bbdd21d300914e4610a2fcd5acc6b292","ref":"refs/heads/master","pushedAt":"2023-10-19T01:05:56.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"crcrpar","name":"Masaki Kozuki","path":"/crcrpar","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16191443?s=80&v=4"},"commit":{"message":"Loop through all available engines for cuDNN heuristics search (#1740)\n\n* Increase magic number\r\n\r\n* Use heuristics engine count\r\n\r\n* Move cuDNN debug message\r\n\r\n---------\r\n\r\nCo-authored-by: Jaemin Choi ","shortMessageHtmlLink":"Loop through all available engines for cuDNN heuristics search (#1740)"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEPByFXQA","startCursor":null,"endCursor":null}},"title":"Activity · NVIDIA/apex"}