08 Apr 21:31

lsy323

6f93cc1

Latest

Highlights

We are excited to announce the release of PyTorch XLA 2.3! PyTorch 2.3 offers experimental support for SPMD Auto Sharding on single TPU host, this allows user to shard their models on TPU with a single config change. We also add the experimental support for Pallas custom kernel for inference, which enables users to make use of the popular custom kernel like flash attention and paged attention on TPU.

Stable Features

PJRT

Experimental GPU PJRT Plugin (#6240)
Define PJRT plugin interface in C++ (#6360 )
Add limit to max inflight TPU computations (#6533)
Remove TPU_C_API device type (#6435)

GSPMD

Introduce global mesh (#6498)
Introduce xla_distribute_module for DTensor integration (#6683 )

Torch Compile

Support activation sharding within torch.compile (#6524 )
Do not cache FX input args in dynamo bridge to avoid memory leak (#6553 )
Ignore non-XLA nodes and their direct dependents. (#6170)

Export

Support of implicit broadcasting with unbounded dynamism (#6219)
Support multiple StableHLO Composite outputs (#6295)
Add support of dynamism for add (#6443 )
Enable unbounded dynamism on conv, softmax, addmm, slice (#6494)
Handle constant variable (#6510 )

Beta Features

CoreAtenOpSet

Support all Core Aten Ops used by `torch.export`

Lower reflection_pad1d, reflection_pad1d_backward, reflection_pad3d and reflection_pad3d_backward (#6588 )
lower replication_pad3d and replication_pad3d_backward (#6566)
Lower the embedding op (#6495)
Lowering for _pdist_forward (#6507)
Support mixed precision for torch.where (#6303)

Benchmark

Unify PyTorch/XLA and Pytorch torchbench model configuration using the same torchbench.yaml (#6881)
Align model data precision settings with pytorch HUD (#6447, #6518, #6555)
Fix some torchbench models configuration to make it runnable using XLA (#6509, #6542, #6558, #6612).

FSDP via SPMD

Make FSDPv2 to use the global mesh API (#6500)
Enable auto-wrapping(#6499)

Distributed Checkpoint

Add process group documentation for SPMD (#6469 )

Usability

Support torch_xla.device (#6571 )

GPU

Fix global_device_count(), local_device_count() for single process on CUDA(#6022)
Automatically use XLA:GPU if on a GPU machine (#6605 )
Add SPMD on GPU instructions (#6684 )
Build XLA:GPU as a separate Plugin (#6825)

Distributed

Support tensor bucketing for all-gather and reduce-scatter for ZeRO1 (#6025 )

Experimental Features

Pallas

Introduce Flash Attention kernel using Pallas (#6827 )
Support Flash Attention kernel with casual mask (#6837)
Support Flash Attention kernel with torch.compile (#6875)
Support Pallas kernel (#6340)
Support programmatically extracting the payload from Pallas kernel (#6696 )
Support Pallas kernel with torch.compile (#6477 )
Introduce helper to convert Pallas kernel to PyTorch/XLA callable (#6713)

GSPMD Auto-Sharding

Support auto-sharding for single host TPU (#6719)
Auto construct auto-sharding mesh ids (#6770)

Input Output Aliasing

Support torch.compile for dynamo_set_buffer_donor
Use XLA’s new API to alias graph input and output (#6855)

While Loop

Support torch._higher_order_ops.while_loop with simple examples (#6532, #6603)

Bug Fixes and Improvements

Propagates requires_grad over to AllReduce output (#6326 )
Avoid fallback for avg_pool (#6409)
Fix output tensor shape for argmin and argmax where keepdim=True and dim=None (#6536)
Fix preserve_rng_state for activation checkpointing (#4690)
Allow int data-type for Embedding indices (#6718 )
Don't terminate the whole process when Compile fails (#6707 )
Fix a incorrect assert on frame count for PT_XLA_DEBUG=1 (#6466)
Refactor nms into TorchVision variant.(#6814)

Assets 2

31 Jan 21:24

zpcore

v2.2.0

053a6f2

PyTorch/XLA 2.2 Release Notes

Cloud TPUs now support the PyTorch 2.2 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.2 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

Installing PyTorch and PyTorch/XLA 2.2.0 wheel:

pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html

Please note that you might have to re-install the libtpu on your TPUVM depending on your previous installation:

pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html

Note: If you meet the error RuntimeError: operator torchvision::nms does not exist when using torchvision in the 2.2.0 docker image, please try the following command to fix the issue:

pip uninstall torch -y; pip install torch==2.2.0

Stable Features

PJRT

PJRT_DEVICE=GPU has been renamed to PJRT_DEVICE=CUDA (#5754).
- PJRT_DEVICE=GPU will be removed in the 2.3 release.
Optimize Host to Device transfer (#5772) and device to host transfer (#5825).
Miscellaneous low-level refactoring and performance improvements (#5799, #5737, #5794, #5793, #5546).

Beta Features

GSPMD

Support DTensor API integration and move GSPMD out of experimental (#5776).
Enable debug visualization func visualize_tensor_sharding (#5742), added doc.
Support mark_shard scalar tensors (#6158).
Add apply_backward_optimization_barrier (#6157).

Export

Handled lifted constants in torch export (#6111).
Run decomp before processing (#5713).
Support export to tf.saved_model for models with unused params (#5694).
Add an option to not save the weights (#5964).
Experimental support for dynamic dimension sizes in torch export to StableHLO (#5790, openxla/xla#6897).

CoreAtenOpSet

PyTorch/XLA aims to support all PyTorch core ATen ops in the 2.3 release. We’re actively working on this, remaining issues to be closed can be found at issue list.

Benchmark

Support of benchmark running automation and metric report analysis on both TPU and GPU (doc).

Experimental Features

FSDP via SPMD

Introduce FSDP via SPMD, or FSDPv2 (#6187). The RFC can be found (#6379).
Add FSDPv2 user guide (#6386).

Distributed Op

Support all-gather coalescing (#5950).
Support reduce-scatter coalescing (#5956).

Persistent Compilation

Enable persistent compilation caching (#6065).
Document and introduce xr.initialize_cache python API (#6046).

Checkpointing

Support auto checkpointing for TPU preemption (#5753).
Support Async checkpointing through CheckpointManager (#5697).

Usability

Document Compilation/Execution analysis (#6039).
Add profiler API for async capture (#5969).

Quantization

Lower quant/dequant torch op to StableHLO (#5763).

GPU

Document multihost gpu training (#5704).
Support multinode training via torchrun (#5657).

Bug Fixes and Improvements

Pow precision issue (#6103).
Handle negative dim for Diagonal Scatter (#6123).
Fix as_strided for inputs smaller than the arguments specification (#5914).
Fix squeeze op lowering issue when dim is not in sorted order (#5751).
Optimize RNG seed dtype for better memory utilization (#5710).

Lowering

_prelu_kernel_backward (#5724).

Assets 2

07 Sep 16:14

ManfeiBai

v2.1.0

8e9d27b

PyTorch/XLA 2.1 Release

Cloud TPUs now support the PyTorch 2.1 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.1 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

PJRT is now PyTorch/XLA's officially supported runtime! PJRT brings improved performance, superior usability, and broader device support. PyTorch/XLA r2.1 will be the last release with XRT available as a legacy runtime. Our main release build will not include XRT, but it will be available in a separate package. In most cases, we expect the migration to PJRT to require minimal changes. For more information, see our PJRT documentation.

GSPMD support has been added as an experimental feature to the PyTorch/XLA 2.1 release. GSPMD will transform the single device program into a partitioned one with proper collectives, based on the user provided sharding hints. This feature allows developers to write PyTorch programs as if they are on a single large device without any custom sharded computation ops and/or collective communications to scale. We published a blog post explaining the technical details and expected usage, you can also find more detail in this user guide.

PyTorch/XLA has transitioned from depending on TensorFlow to depending on the new OpenXLA repo. This allows us to reduce our binary size and simplify our build system. Starting from 2.1, PyTorch/XLA will release our TPU whl on the pypi.

To install PyTorch/XLA 2.1.0 wheels, please find the installation instructions below.

Installing PyTorch and PyTorch/XLA 2.1.0 wheel:

pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 -f https://storage.googleapis.com/libtpu-releases/index.html

Please note that you might have to re-install the libtpu on your TPUVM depending on your previous installation:

pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html

Stable Features

OpenXLA

Migrate to pull XLA from TensorFlow to OpenXLA, TF pin dependency sunset (#5202)
Instructions to build PyTorch/XLA with OpenXLA can be found in this doc.

PjRt Runtime

Move PJRT APIs from experimental to torch_xla.runtime (#5011)
Enable PJRT C API Client and other changes for Neuron (#5428)
Enable PJRT C API Client for Intel XPU (#4891)
Change pjrt:// init method to xla:// (#5560)
Make TPU detection more robust (#5271)
Add runtime.host_index (#5283)

Functionalization

Functionalization integration (#4158)
Add support for XLA_DISABLE_FUNCTIONALIZATION flag (#4792)

Improvements and additions

Op Lowering
- squeeze_copy.dims (#5286)
- native_dropout (#5643)
- native_dropout_backward (#5642)
- count_nonzero (#5137)
Build System
- Migrate the build system to Bazel (#4528)

Beta Features

AMP (Automatic MIxed Precision)

Added bfloat16 support on TPUs. (#5161)
Documentation can be found in amp.md

TorchDynamo

Support CPU egaer fallback in Dynamo bridge (#5000)
Support torch.compile with SPMD for inference (#5002)
Update the dynamo backend name to openxla and openxla_eval (#5402)
Inference optimization for SPMD inference + torch.compile (#5447, #5446)

Traceable Collectives

Adopts traceable all_reduce (#4915)
Make xm.all_gather a single graph in Dynamo (#4922)

Experimental Features

GSPMD

Add SPMD user guide
Enable Input-output aliasing (#5320)
Introduce global_runtime_device_count to query the runtime device count (#5129)
Support partial replication (#5411 )
Support tuple partition spec (#5488)
Support mark_sharding on IRs (#5301)
Make IR sharding custom sharding op (#5433)
Introduce Hybrid Device mesh creation (#5147)
Introduce SPMD-friendly patched nn.Linear (#5491)
Allow dumping post optimizations HLO (#5302)
Allow sharding n-d tensor on (n+1)-d Mesh (#5268)
Support synchronous distributed checkpointing (#5130, #5170)

Serving Support

SavedModel
- Added a script stablehlo-to-saved-model (#5493)
- docs:https://github.com/pytorch/xla/blob/r2.1/docs/stablehlo.md#convert-saved-stablehlo-for-serving

StableHLO

Add StableHLO user guide (#5523)
Add save_as_stablehlo and save_torch_model_as_stablehlo APIs (#5493)
Make StableHLO executable (#5476)

Ongoing Development

TorchDynamo

Enable single step graph for training
Avoid inter-graph reshapes from aot_autograd
Support GSPMD for activation checkpointing

GSPMD

Support auto-sharding
Benchmark and improving GSPMD for XLA:GPU
Integrating to PyTorch’s Distributed Tensor API

GPU

Support Multi-host GPU for PJRT runtime
Improve performance on torchbench models

Quantization

Support PyTorch PT2E quantization workflow

Bug Fixes and Improvements

Fix unexpected Dynamo crash due to clear_pending_ir call(#5582)
Fix FSDP for Models with Frozen Weights (#5484)
Fix data type in Pow with Scalar base and Tensor exponent (#5467)
Fix the inplace op crash when applied on self tensors in dynamo (#5309)

Assets 2

12 Aug 07:23

miladm

v2.0.0

500e1c2

PyTorch/XLA 2.0 release

Cloud TPUs now support the PyTorch 2.0 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in PyTorch's 2.0 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

Beta Features

PJRT runtime

Checkout our newest document; PjRt is the default runtime in 2.0.
New Implementation of xm.rendezvous with XLA collective communication which scales better (#4181)
New PJRT TPU backend through the C-API (#4077)
Use PJRT to default if no runtime is configured (#4599)
Experimental support for torch.distributed and DDP on TPU v2 and v3 (#4520)

FSDP

Add auto_wrap_policy into XLA FSDP for automatic wrapping (#4318)

Stable Features

Lazy Tensor Core Migration

Migration is completed, checkout this dev discussion for more detail.
Naively inherits LazyTensor (#4271)
Adopt even more LazyTensor interfaces (#4317)
Introduce XLAGraphExecutor (#4270)
Inherits LazyGraphExecutor (#4296)
Adopt more LazyGraphExecutor virtual interfaces (#4314)
Rollback to use xla::Shape instead of torch::lazy::Shape (#4111)
Use TORCH_LAZY_COUNTER/METRIC (#4208)

Improvements & Additions

Add an option to increase the worker thread efficiency for data loading (#4727)
Improve numerical stability of torch.sigmoid (#4311)
Add an api to clear counter and metrics (#4109)
Add met.short_metrics_report to display more concise metrics report (#4148)
Document environment variables (#4273)
Op Lowering
- _linalg_svd (#4537)
- Upsample_bilinear2d with scale (#4464)

Experimental Features

TorchDynamo (torch.compile) support

Checkout our newest doc.
Dynamo bridge python binding (#4119)
Dynamo bridge backend implementation (#4523)
Training optimization: make execution async (#4425)
Training optimization: reduce graph execution per step (#4523)

PyTorch/XLA GSPMD on single host

Preserve parameter sharding with sharded data placeholder (#4721)
Transfer shards from server to host (#4508)
Store the sharding annotation within XLATensor(#4390)
Use d2d replication for more efficient input sharding (#4336)
Mesh to support custom device order. (#4162)
Introduce virtual SPMD device to avoid unpartitioned data transfer (#4091)

Ongoing development

Ongoing Dynamic Shape implementation

Implement missing XLASymNodeImpl::Sub (#4551)
Make empty_symint support dynamism. (#4550)
Add dynamic shape support to SigmoidBackward (#4322)
Add a forward pass NN model with dynamism test (#4256)

Ongoing SPMD multi host execution (#4573)

Bug fixes & improvements

Support int as index type (#4602)
Only alias inputs and outputs when force_ltc_sync == True (#4575)
Fix race condition between execution and buffer tear down on GPU when using bfc_allocator (#4542)
Release the GIL during TransferFromServer (#4504)
Fix type annotations in FSDP (#4371)

Assets 2

29 Nov 01:07

vanbasten23

v1.13.0

c62c5a5

PyTorch/XLA 1.13 release

Cloud TPUs now support the PyTorch 1.13 release, via PyTorch/XLA integration. The release has daily automated testing for the supported models: Torchvision ResNet, FairSeq Transformer and RoBERTa, HuggingFace GLUE and LM, and Facebook Research DLRM.

On top of the underlying improvements and bug fixes in PyTorch's 1.13 release, this release adds several features and PyTorch/XLA specified bug fixes.

New Features

GPU enhancement
- Add upsample_nearest/bilinear implementation for CPU and GPU (#3990)
- Set three_fry as the default RNG for GPU (#3951)
FSDP enhancement
- allow FSDP wrapping and sharding over modules on CPU devices (#3992)
- Support param sharding dim and pinning memory (#3830)
Lower torch::einsum using xla::einsum which provide significant speedup (#3843)
Support large models with >3200 graph input on TPU + PJRT (#3920)

Experimental Features

PJRT experimental support on Cloud TPU v4
- Check the instruction and example code in here
DDP experimental support on Cloud TPU and GPU
- Check the instruction, analysis and example code in here

Ongoing development

Ongoing Dynamic Shape implementation (POC completed)
Ongoing SPMD implementation (POC completed)
Ongoing LTC migration

Bug fixes and improvements

Make XLA_HLO_DEBUG populate the scope metadata (#3985)

Assets 2

29 Jun 01:22

wonjoolee95

v1.12.0

82fbe57

PyTorch/XLA 1.12 release

Cloud TPUs now support the PyTorch 1.12 release, via PyTorch/XLA integration. The release has daily automated testing for the supported models: Torchvision ResNet, FairSeq Transformer and RoBERTa, HuggingFace GLUE and LM, and Facebook Research DLRM.

On top of the underlying improvements and bug fixes in PyTorch's 1.12 release, this release adds several features and PyTorch/XLA specified bug fixes.

New feature

FSDP
- Check the instruction and example code in here
- FSDP support for PyTorch/XLA (#3431)
- Bfloat 16 and float 16 support in FSDP (#3617)
PyTorch/XLA gradident checkpoint api (#3524)
Optimization_barrier which enables gradient checkpointing (#3482)
Ongoing LTC migration
Device lock position optimization to speed up tracing (#3457)
Experimental support for PJRT TPU client (#3550)
Send/Recv CC op support (#3494)
Performance profiling tool enhancement (#3498)
TPU-V4 pod official support (#3440)
Roll lowering (#3505)
Celu, celu_, selu, selu_ lowering (#3547)

Bug fixes and improvements

Fixed a view bug which will create unnecessary IR graph (#3411)

Assets 4

15 Mar 23:46

miaoshasha

v1.11.0

3b12115

PyTorch/XLA 1.11 release

Cloud TPUs now support the PyTorch 1.11 release, via PyTorch/XLA integration. The release has daily automated testing for the supported models: Torchvision ResNet, FairSeq Transformer and RoBERTa, HuggingFace GLUE and LM, and Facebook Research DLRM.

On top of the underlying improvements and bug fixes in PyTorch's 1.11 release, this release adds several features and PyTorch/XLA specified bug fixes.

New feature

Enable asynchronous RNG seed sending by environment variable XLA_TRANSFER_SEED_ASYNC
Add a native torch.distributed backend
Introduce a Eager debug mode by environment variable XLA_USE_EAGER_DEBUG_MODE
Add synchronous free Adam and AdamW optimizers for PyTorch/XLA:GPU AMP
Add synchronous free SGD optimizers for PyTorch/XLA:GPU AMP
linspace lowering
mish lowering
prelu lowering
slogdet lowering
stable sort lowering
index_add with alpha scaling lowering

Bug fixes && improvements

Improve torch.var performance and numerical stability on TPU
Improve torch.pow performance
Fix the incorrect output dtype when divide a f32 by a f64
Fix the incorrect result of nll_loss when reduction = "mean" and whole target is equal to ignore_index

Assets 4

25 Oct 17:10

miaoshasha

v1.10.0

8fb44f9

PyTorch/XLA 1.10 release

Cloud TPUs now support the PyTorch 1.10 release, via PyTorch/XLA integration. The release has daily automated testing for the supported models: Torchvision ResNet, FairSeq Transformer and RoBERTa, HuggingFace GLUE and LM, and Facebook Research DLRM.

On top of the underlying improvements and bug fixes in PyTorch's 1.10 release, this release adds several PyTorch/XLA-specific bug fixes:

Add support for reduce_scatter
Introduce the AMP Zero gradients optimization for XLA:GPU
Introduce the environment variable XLA_DOWN_CAST_BF16 and XLA_DOWNCAST_FP16 to downcast input tensors
adaptive_max_pool2d lowering
nan_to_num lowering
sgn lowering
logical_not/logical_xor/logical_or/logical_and lowering
amax lowering
amin lowering
std_mean lowering
var_mean lowering
lerp lowering
isnan lowering

Assets 4

04 Mar 23:45

zcain117

v1.8.0

f2f8f44

PyTorch/XLA 1.8 release

Summary

Cloud TPUs now support the PyTorch 1.8 release, via PyTorch/XLA integration. The release has daily automated testing for the supported models: Torchvision ResNet, FairSeq Transformer and RoBERTa, HuggingFace GLUE and LM, and Facebook Research DLRM.

This release focused on making PyTorch XLA easier to use and debug. See below for a list of new features.

New Features

Enhanced usability:
- Profiler tools to help you pinpoint the areas where you can improve the memory usage or speed of your TPU models. The tools are ready to use; check out our main README for some upcoming tutorials.
- Simpler error messages (#2771)
- Less log spam using TPU Pods (#2662)
- Able to view images in Tensorboard (#2679)
TriangularSolve (#2498) (example)
New ops supported by PyTorch/XLA:
- random_ (#2617)
- adaptive_avg_pool3d (#2616)
- UpsampleNearest2D (#2597)

Bug Fixes

Crashing while using dynamic shapes (#2602)
all_to_all crashing on TPU pods (#2601)
SiLU fix (#2721)

Assets 2

19 Aug 21:19

jysohn23

v1.6.0

9703109

PyTorch/XLA 1.6 Release (GA)

Highlights

Cloud TPUs now support the PyTorch 1.6 release, via PyTorch/XLA integration. With this release we mark our general availability (GA) with the models such as ResNet, FairSeq Transformer and RoBERTa, and HuggingFace GLUE task models that have been rigorously tested and optimized.

In addition, with our PyTorch/XLA 1.6 release, you no longer need to run the env-setup.py script on Colab/Kaggle as those are now compatible with native torch wheels. See here for an example of the new Colab/Kaggle install step. You can still continue to use that script if you would like to run with our latest unstable releases.

New Features

XLA RNG state checkpointing/loading (#2096)
Device Memory XRT API (#2295)
[Kaggle/Colab] Small host VM memory environment utility (#2025)
[Advanced User] XLA Builder Support (#2125)
New ops supported on PyTorch/XLA
- Hardsigmoid (#1940)
- true_divide (#1782)
- max_unpool2d (#2188)
- max_unpool3d (#2188)
- Replication_pad1d (#2188)
- Replication_pad2d (#2188)
Dynamic shape support on XLA:CPU and XLA:GPU (experimental)

Bug Fixes

RNG Fix (proper randomness with bernoulli and dropout) (#1932)
Manual all-reduce in backward pass (#2325)

Assets 2

Releases: pytorch/xla

PyTorch/XLA 2.3 Release Notes

Highlights

Stable Features

PJRT

GSPMD

Torch Compile

Export

Beta Features

CoreAtenOpSet

Support all Core Aten Ops used by torch.export

Benchmark

FSDP via SPMD

Distributed Checkpoint

Usability

GPU

Distributed

Experimental Features

Pallas

GSPMD Auto-Sharding

Input Output Aliasing

While Loop

Bug Fixes and Improvements

PyTorch/XLA 2.2 Release Notes

Stable Features

PJRT

Beta Features

GSPMD

Export

CoreAtenOpSet

Benchmark

Experimental Features

FSDP via SPMD

Distributed Op

Persistent Compilation

Checkpointing

Usability

Quantization

GPU

Bug Fixes and Improvements

Lowering

PyTorch/XLA 2.1 Release

Stable Features

OpenXLA

PjRt Runtime

Functionalization

Improvements and additions

Beta Features

AMP (Automatic MIxed Precision)

TorchDynamo

Traceable Collectives

Experimental Features

GSPMD

Serving Support

StableHLO

Ongoing Development

TorchDynamo

GSPMD

GPU

Quantization

Bug Fixes and Improvements

PyTorch/XLA 2.0 release

Beta Features

PJRT runtime

FSDP

Stable Features

Lazy Tensor Core Migration

Improvements & Additions

Experimental Features

TorchDynamo (torch.compile) support

PyTorch/XLA GSPMD on single host

Ongoing development

Ongoing Dynamic Shape implementation

Ongoing SPMD multi host execution (#4573)

Bug fixes & improvements

PyTorch/XLA 1.13 release

New Features

Experimental Features

Ongoing development

Bug fixes and improvements

Support all Core Aten Ops used by `torch.export`