Skip to content

PyTorch 1.8 Release, including Compiler and Distributed Training updates, New Mobile Tutorials and more

Compare
Choose a tag to compare
@albanD albanD released this 04 Mar 20:44
37c1f4a

PyTorch 1.8.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

  1. Support for doing python to python functional transformations via torch.fx;
  2. Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
  3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression. See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

You can find more details on all the highlighted features in the PyTorch 1.8 Release blogpost.

Backwards Incompatible changes

Fix Tensor inplace modulo in python (#49390)

Inplace modulo in python %= was wrongfully done out of place for Tensors. This change fixes the behavior.
Previous code that was relying on this operation being done out of place should be updated to use the out of place version t = t % other instead of t %= other.

1.7.11.8.0
>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
      
>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
      

Standardize torch.clamp edge cases (#43288)

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU implementation but uses different approaches for other backends.
These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

>>> t = torch.arange(200).to(torch.float)
>>> torch.clamp(t, 4, 2)[0]
tensor(2.)

>>> torch.clamp(t.cuda(), 4, 2)[0]
tensor(4., device='cuda:0')

>>> torch.clamp(torch.tensor(0), 4, 2)
tensor(4)

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max. Python has no standard clamp implementation.

Tensor deepcopy now properly copies the .grad field (#50663)

The deepcopy protocol will now properly copy the .grad field of Tensors when it exists.
The old behavior can be recovered by setting the .grad field to None after doing the deepcopy.

1.7.11.8.0
>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
None
      
>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
tensor([0.8883, 0.5765])
      

Fix torch.fmod type promotion (#47323, #48278)

1.7.1
Raises RuntimeError for integral tensor and floating-point tensor.
The dtype of output is determined by the first input.

>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
RuntimeError: result type Float can't be cast to the desired output type Int
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float32
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)

1.8.0:
Support integral tensor and floating-point tensor as inputs.
The dtype of output is determined by both inputs.

>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
tensor([1.0000, 0.7000, 0.0000, 0.6000, 1.2000])
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float64
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])

Preserve non-dense or overlapping tensor's layout in *_like functions (#46046)

All the *_like factory functions will now generate the same striding as out of place operations would.
This means in particular that non-contiguous tensors will produce non-contiguous outputs.
If you require a contiguous output, you can pass the memory_format=torch.contiguous keyword argument to the factory function. Such factory functions include clone, to, float, cuda, *_like, zeros, rand{n}, etc.

Make output of torch.norm and torch.linalg.norm consistent for complex inputs (#48284)

Previously, when given a complex input, torch.linalg.norm and torch.norm would return a complex output. torch.linalg.cond would sometimes return a complex output and sometimes return a real output when given a complex input, depending on its p argument. This PR changes this behavior to match numpy.linalg.norm and numpy.linalg.cond, so that a complex input will result in a real number type, consistent with NumPy.

Make torch.svd return V, not V.conj() for complex inputs (#51012)

torch.svd added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex V tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.
Users that were already using the previous version of torch.svd with complex inputs can recover the previous behavior by taking the complex conjugate of the returned V.

torch.angle: properly handle pure real numbers (#49163)

This PR updates PyTorch's torch.angle operator to be consistent with NumPy's. Previously torch.angle would return zero for all real inputs (including NaN). Now angle returns pi for negative real inputs, zero for non-negative real inputs, and propagates NaNs.

Enable distribution validation by default for torch.distributions (#48743)

This may slightly slow down some models. Concerned users may disable validation by using torch.distributions.Distribution.set_default_validate_args(False) or by disabling individual distribution validation via MyDistribution(..., validate_args=False).

This may cause new ValueErrors in models that rely on unsupported behavior, e.g. Categorical.log_prob() applied to continuous-valued tensors (only {0,1}-valued tensors are supported).
Such models should be fixed but the previous behavior can be recovered by disabling argument validation using the methods mentioned above.

Prohibit assignment to a sparse tensor (#50040)

Assigning to a sparse Tensor did not work properly and resulted in a no-op. The following code now properly raises an error:

>>> t = torch.rand(10).to_sparse()
>>> t[0] = 42
TypeError: Cannot assign to a sparse tensor

C++ API: operators that take a list of optional Tensors cannot be called with ArrayRef<Tensor> anymore (#49138)

This PR changes the C++ API representation of lists of optional Tensors (e.g. in the Tensor::``index method) from ArrayRef<Tensor> to List<optional<Tensor>>. This change breaks backwards compatibility, since there is no implicit conversion from ArrayRef<Tensor> to List<optional<Tensor>>.

A common call pattern is tensor.index({indices_tensor}), where indices_tensor is a Tensor. This will continue to work because the {} initializer_list constructor for List<optional<Tensor>> can take Tensor elements that are implicitly converted to optional<Tensor>.

However, another common call pattern is tensor.index(indices_tensor), where previously the Tensor got implicitly converted to an ArrayRef<Tensor>. To implicitly convert Tensor -> optional<Tensor> -> List<optional<Tensor>> would chain two implicit conversions, which C++ doesn't allow. So those call sites should be rewritten to use the tensor.index({indices_tensor}) pattern.

Autograd view creation informations are now properly propagated when views are chained

After this fix, an error will properly be thrown to avoid wrong gradients when an in-place operation is performed on a view of a view, when in-place operation were not allowed on the first view.
This means that code that used to return wrong gradients in 1.7.1 (such as t.unbind()[0].select(0, 0).add_(1)) will now properly raise an error.

End of deprecation cycle for spectral ops in the torch. namespace (#48594)

This PR removes the deprecated torch.{fft,rfft,ifft,irfft} and their corresponding methods on torch.Tensor. PyTorch programs using these functions must now update to use the torch.fft namespace.

torch.digamma : properly handle all inputs (#48302)

This PR updates PyTorch's torch.digamma function to be consistent with SciPy's special.digamma function. This changes the result of the torch.digamma function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates torch.digamma to return NaN. For zero, however, it returns -inf to be consistent with SciPy.

Interestingly, SciPy made a similar change, which was noticed by at least one user: scipy/scipy#9663

SciPy's returning of negative infinity at zero is intentional:
https://github.com/scipy/scipy/blob/59347ae8b86bcc92c339efe213128f64ab6df98c/scipy/special/cephes/psi.c#L163

This change is consistent with the C++ standard for the gamma function:
https://en.cppreference.com/w/cpp/numeric/math/tgamma

Fix torch.remainder type promotion (#48668)

1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.

>>> torch.remainder(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)

1.8.0
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.

>>> torch.remainder(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])

Changes to onnx export API to better handle named arguments (#47367)

The args input argument of the torch.onnx.export function is updated to better support optional arguments. An optional dictionary can be passed in addition as the last argument in the args tuple, specifying inputs with the corresponding named parameter. Note that this is backward breaking for cases where the last input is also of a dictionary type. In the new API, for such cases, it is mandatory to have an empty dictionary as the last argument in the args tuple.
More details can be found at: https://pytorch.org/docs/1.8.0/onnx.html?highlight=onnx#using-dictionaries-to-handle-named-arguments-as-model-inputs.

Update signature of torch.quantization.quantize function #48537

The run_args argument must now contain a list or tuple containing the positional arguments, even if there is only a single argument.
In particular, code like: qmodel = quantize(float_model, default_eval_fn, img_data) that was working in 1.7.1 will now raise the error: TypeError: default_eval_fn() takes 2 positional arguments but 3 were given.
You should update this code to provide the image in a list for example: qmodel = quantize(float_model, default_eval_fn, [img_data])

Change the way we quantize relu, leaky relu and sigmoid(#47415, #48038, #45702,#45711, #45883 #45883, #45882, #47660)

Starting with version 1.8.0, in the eager mode quantization flow, relu is not observed anymore as it is not needed.
In previous versions, quantized leaky_relu and sigmoid did not require observation and just inherited the quantization parameters from their input, but that does not work very well in eager mode quantization. Starting with version 1.8.0, they are observed operator so that they work better in eager mode quantization.

Update direction numbers to 21201 dims in the SobolEngine (#49710)

This update is BC-breaking because the values drawn by the engine will be different from the ones drawn in 1.7.1 even with the same seed.

1.7.11.8.0
>>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.5000],
            [0.7500],
            [0.2500]])
      
>>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.0000],
            [0.5000],
            [0.7500]])
      

Deprecations

Python API

Deprecate old style nn.Module backward hooks (#46163)

Old style nn.Module backward hooks have been broken for a long time (they do not behave as advertised in the documentation). We now have new nn.Module.register_full_backward_hook that provide a fully working implementation of these hooks.
The old function should not be used and migrated to the new full version.

An example of this discrepancy is shown in the example below where a Linear layer takes as input a single Tensor of size 5 and returns a single Tensor of size 5 but old style hook would return two gradients with respect to the input for only one input.

1.7.1:

import torch
from torch import nn

mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
    print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
    print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> `grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])`

1.8.0:
Old style hooks are deprecated and will warn when providing wrong result.

import torch
from torch import nn

mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
    print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
    print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])
>>> `UserWarning: Using a non-full backward hook when the forward contains multiple
autograd Nodes is deprecated and will be removed in future versions. This hook
will be missing some grad_input.`

Full hooks should be used to get the proper result all the time and avoid warnings

mod.register_full_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5])
>>> grad output size: torch.Size([5])

torch.stft: Deprecate default value of the require_complex argument (#49022, #50102)

Previously torch.stft took an optional return_complex parameter that indicated whether the output would be a real tensor or a complex tensor. return_complex has the default value of False. This default value is deprecated (meaning that this optional argument is becoming mandatory) and will be removed in future versions. You can pass this argument explicitly to avoid this deprecation.

Deprecate torch.set_deterministic in favor of torch.use_deterministic_algorithms (#49904)

This beta feature is being renamed for improved clarity. Users should migrate to use the new name.

Deprecate torch.* linear algebra functions in favor of the torch.linalg.* variant for cholesky (#51460), slogdet (#51354), inverse (#51672), pinverse (#51671)

All the linear algebra functions are being moved to the torch.linalg submodule that provided a compatible API with NumPy. These new functions have the same set of features as the torch. ones and should be used instead.

New features

Python API

  • New functions (most of them to improve numpy compatibility): torch.nan_to_num (#44592), torch.tensor_split (#45168), torch.nanmedian (#45847), torch.ravel (#46098), torch.igamma (#46183), torch.igammac (#48171), torch.{column_stack,row_stack} (#46313), torch.kron (#45358), torch.copysign (#46396), Tensor.new_empty_strided (#47225), torch.{swapdims,swapaxes} (#46041), torch.tile (#47974), torch.float_power (#44937), torch.moveaxis (#48581), torch.inner (#46716), torch.msort (#48440), torch.sinc (#48740), torch.broadcast_to (#48997), torch.xlogy (#48777), torch.f{max,min} (#49312), torch.diff (#50569), torch.ldexp (#45370), torch.broadcast_shapes (#43935),
  • torch.fft new features: 2D FFT functions (#45164), use new FFT operators in stft (#47601), helper functions (#44877), fuzzing benchmark (#47872)
  • torch.linalg new features: linalg.tensorsolve (#46142), linalg.cholesky (#46083), linalg.tensorinv (#45969), linalg.{eigh,eigvalsh} (#45526), linalg.matrix_rank (#48206), linalg.solve (#48456), linalg.qr (#47764, #50046), linalg.svd (#45562), linalg.inv (#48261), linalg.pinv (#48399), linalg.slogdet (#49194), linalg.cond (#45832)
  • New torch.nn Modules: nn.PixelUnshuffle (#49334), nn.GaussianNLLLoss (#50886)
  • Automatic shape inference in torch.nn: new nn.LazyLinear (#44538), nn.LazyConv{1,2,3}d and nn.LazyConvTranspose{1,2,3}d (#47350)
  • Add channels last support for torch.nn.AdaptiveAvgPool2d (#48916)
  • Add option to produce standalone executable with cpp_extensions (#47862)
  • Add sparse-sparse matrix multiplication support (#39526)
  • Add torch.futures.Future.add_done_callback (#45675)
  • Add three_phase optional argument to torch.optim.lr_scheduler.OneCycleLR (#42715)
  • Add bicubic option for the mode argument of torch.nn.functional.grid_sampler (#44780)
  • Add new distributions to torch.distributions: Kumaraswamy (#48285), LKJCholesky (#48798)
  • Add reparameterization support to torch.distributions.OneHotCategorical (#46610)
  • Add new transforms to torch.distributions: CorrCholeskyTransform (#48041)
  • Add new constraint to torch.distributions: independent (#50547, #50302)
  • Add zero annealing epochs to SWA optimizer (#47579)
  • Add close method to torch.hub.tqdm mock (#46040)
  • Add support for pruning based on custom importance scores via the importance_scores keyword argument (#48378)
  • Add torch vitals (#51047)

Complex Numbers

  • Complex Number support on CPU and CUDA for torch.symeig (#45121), torch.pinverse (#45819), torch.det (#45980), torch.diagflat (#47564), torch.{addcmul, addcdiv} (#46639), torch.lu_solve (#48028), torch.matrix_exp (#48363), torch.eig (#49168), torch.{acosh, asinh, atanh} (#50387), torch.masked_scatter (#51281), torch.bmm and torch.baddbmm (#42553), torch.orgqr (#50502), torch.index_fill_ (#50578), torch.cholesky_inverse (#50269)
  • Complex Number support on CUDA for torch.qr (#45032), torch.lu (#45898), torch.prod(#45980), torch.triangular_solve (#46916), torch.solve (#47045), torch.cholesky_solve (#47047), torch.mean (#47048), torch.svd (#45795), torch.inverse (#47595), torch.Tensor.index_put_ (#51148)
  • Complex Number support on CPU for torch.trace (#50380)
  • Complex Number support for torch.nn.DataParallel (#48686), torch.nn.L1Loss (#49912), Padding functions (#50594)
  • Complex Number support for torch.distributed.{all_reduce, all_gather} (#45879, #46270)
  • Complex Autograd support for torch.{atan, log, log10, log1p, log2, reciprocal, tan, pow, rsqrt, tanh, asinh, acosh} (#46275), torch.{cholesky, triangular_solve, mm, mv, ger} (#45737), torch.take(), torch.Tensor.fill_() (#46860), torch.matrix_exp (#48363), torch.{baddbmm, addbmm, addmm, addmv} (#50632), torch.qr (#48489), torch.svd and torch.pinverse (#47761), torch.sqrt (#49461), torch.diag (#51268), torch.trace (#51537), torch.exp (#47194), torch.mean (#47566), torch.addr (#50667), torch.{stack, gather, index_select}, torch.Tensor.index_add_(#49552), torch.{masked_scatter, masked_select} (#51281), torch.{addcmul, addcdiv} (#46639), torch.{acosh, asinh, atanh} (#50387), torch.solve (#47045), torch.cholesky_solve (#47047), torch.inverse (#47595)
  • Add complex autograd support for named tensors (#47289)
  • Allow converting parameters and buffers of torch.nn.Module to complex dtypes (#44788)
  • Add complex support to IValues (#50883, #51476)
  • Add TorchScript type annotation logic for complex numbers (#50884)
  • Add serialization logic for complex numbers (#51287)
  • Add support for complex number lists in JIT (#51145)
  • Add support for complex valued keys for dict in TorchScript (#51472)
  • Add scalar.conj() (#46596)
  • Add Tensor.copy_() for ComplexHalf tensors (#45339)

Profiler

  • New profiler API (#48280)
  • Use libkineto in profiler (#46470)
  • Add FLOPS computation support to the new profiler API (#51734)
  • Add high level profiling trace for dataloading and optimizer (#47655)
  • Add support for SVG visualization (#48438)

Autograd

  • Add inputs argument to autograd.backward() both in python and c++ (#46855, #47214)
  • Add support for Tensor-like objects in torch.autograd.gradcheck (#45732)
  • Add experimental vectorize flag to torch.autograd.functional.{jacobian, hessian} (#50915, #51638)
  • Add anomaly mode in C++ API (#46981, #47164)
  • Make torch.lu differentiable. (#46284)
  • Add support for generators in autograd decorators like torch.no_grad (#49017)

Dataloader

CUDA

  • Allow user to specify a fraction of the GPU memory with set_per_process_memory_fraction. (#48172)
  • CUDA BFloat16 TopK (#44755)
  • Add LazyNVRTC (#45674)
  • Enable CUDA Fuser for ROCm (#45965)
  • Define the record_stream method in native_functions.yaml (#44301)
  • Add CUDA 11.1 docker build (#46283)
  • Add nvtx.range() context manager (#42925)
  • CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
  • [ROCm] enable stream priorities (#47136)
  • Add bfloat support for torch.randn and torch.norm (#47143)
  • CUDA BFloat16 Dropout (#45005), batchnorm (non-cuDNN) (#44994), backwards (#48809), sparse (#48807), indexing (#48801), embedding (#44848), signal windows (#45155), norm (#48806), isinf and isfinite (#49356), gemms on arch other than ampere (#50442), clamp, remainder, lshift, rshift (#45247)
  • Make CUDAGeneratorImpl capturable (#48694)
  • Adding support for CuDNN-based LSTM with projections (#47725)
  • Add torch.cuda.can_device_access_peer (#50446)
  • Add torch::cuda::ncll::all2all (#45900)

C++ API

  • Add distance-agnostic triplet margin loss (#45377)
  • Add torch::nn::ModuleDict (#47707)
  • Add torch::cuda::synchronize (#50072)
  • Add new XPU backend type for Intel heterogeneous computation platform. (#49786)

TorchScript

  • torch::jit::freeze C++ api introduced (#52337, #52392)
  • Add API for ignoring arbitrary module attributes during compilation (#45262)
  • Support tracing tensor __setitem__ with dynamic shape (#45828)
  • Expose script_if_tracing as public API (#46494)
  • Support %-based string formatting (#45976)
  • Add torch.jit.isinstance support for typed containers (#46062)
  • Allow for source code comments at any level of indentation (#46548)
  • Support hashing of various data types by implementing generic hashing for IValues (#46441)
  • Support doc string for TorchBind custom classes (#46576)
  • Add API for selective lowering of modules to custom JIT backend (#43613)
  • add list() support (#42382)
  • Support using lambda function as TorchBind constructor (#47819)
  • Support user defined classes as constants (#45556)
  • Allow del statements with multiple targets (#48876)
  • Tuple Slice with both negative and positive stepped size (#48660)
  • Expose run_async function on torch::jit::Method (#48607)
  • Add flag torch_jit_disable_warning_prints to allow disabling all warnings.warn (#49313)
  • Add dict comprehension (#47774)
  • Adding support for bitwise augassignment operators (+= style statements) (#44621)
  • Support the in operator with str (#47057)
  • Adding JIT support for cuda streams and events (#48020)
  • Add Type::{castRaw,expectRef} (#50061)
  • Allow arbitrary docstrings to be inside torchscript interface methods (#50271)
  • Change list striding parameters to take optional integer (#48719)
  • Add support for scripting and running module level hooks in JIT (#49544, #49975, #49545, #49546, #49547)
  • Support default argument values of a method (#48863)
  • Graceful invalidation of Python Node/Value/Block when C++ object is deleted (#50326)
  • Support Union[NoneType, T] as input type (#51605)
  • Allow implicit boolean conversion of lists, strings, and dictionaries (#51683)

Mobile

  • Add instance_key into mobile stats logging. (#45517)
  • Profiling allocator for mobile. (#43951)
  • [Metal] Add Metal/MPSCNN support on iOS (#46112)
  • [Metal] Introduce USE_PYTORCH_METAL (#46383)
  • [Metal] Support Resnet models (b63ddd6)
  • PyTorch NNAPI integration prototype (#46780)
  • [Metal] Enable Metal on macosx (#47635)
  • [Metal] Enable optimize_for_mobile on Linux (#46384)
  • [Android] Fix YUV camera image to tensor (#50871)
  • [Android] turn on USE_VULKAN for android builds by default (#51291)
  • Add windows JNI support (#44257)
  • Enable partial loading of GPU models on linux CPU machines (#51236)

Distributed

  • Support send and recv in c10d NCCL backend (#44921, #44922)
  • Add support for NCCL alltoall (#44374)
  • Upstream fairscale.nn.Pipe into PyTorch as torch.distributed.pipeline (#44090)
  • Add a --logdir option to log subprocess output to files in DDP launcher. (#33193)
  • Support RRef.backward() for local RRefs. (#46568) and Owner RRefs. (#46641)
  • Support C++ implementation for DDP communication hook. (#46566)
  • Provide 2 default C++ comm hooks for DDP (#46701)
  • Support remote device format "worker_name/device" (#46773)
  • Enable creation and transfer of ScriptModule over RPC (#48293)
  • Enable TCPStore on Windows (#47749)
  • Support torch.distributed.irecv(src=None, ...) as recv_anysource (#49383)
  • Implement layer-wise PowerSGD as a DDP comm hook (#49639)
  • Support alltoall_single in TorchScript (#48345)
  • Enable GPU-to-GPU comm in TensorPipeAgent (#44418)
  • Support timeout in rref._get_type() (#50498)
  • Support timeout for RRef proxy functions (#50499)
  • Add optimizer state sharding as ZeroRedundancyOptimizer (#46750)
  • Add distributed functional Adam optimizer (#50624), sgd optimizer (#50618), Adadelta optimizer (#50623), RMSprop optimizer (#50619), l AdamW optimizer (#50620)
  • Create a DDPLoggingData struct and expose it to python interface (#50622)
  • Implement autograd functions for c10d communication operations (#40762)
  • Enable TensorPipe's SHM transport (#50760)
  • Support device map for distributed autograd while using TensorPipe. (#44859)
  • Create PyTorch DDP logging APIs for applications to use (#50637)
  • Add set_exception API in torch.futures.Future (#50983)
  • Add scatter_object_list API for c10d (#43930)
  • Provide parameter to pass GPU ID in barrier function (#49069)
  • Enable TensorPipe CUDA fallback channel (#50675)
  • Enable TensorPipe's InfiniBand transport (#50761)

torch.fx

  • allow custom behavior for args, kwargs, and bool (#45193)
  • Mutable Graph APIs (#45227)
  • Make output a non-special Node (#45599)
  • Make Tracer.trace() just return a Graph (#45704)
  • Preserve type annotations on generated code in Graph (#45880)
  • Make graph_copy examine existing values in val_map (#46104)
  • Allow tracing free functions (#46268)
  • Make sure args/kwargs are immutable (#46325)
  • Make wrapped functions traceable (#46692)
  • Added GraphModule.to_folder (#47544)
  • Support default args in symbolic tracing (#47615)
  • Add Node.all_input_nodes (#48270)
  • Support torchbind as attribute in torch.fx symbolic tracing (#48732)
  • Create subgraph rewriter API (#49540)
  • Make len traceable and scriptable with wrap (#50184)
  • Add Interpreter and Transformer APIs (#50420)
  • Add alternative prettyprinting method to Graph (#50878)
  • Move some heavily used passes out of experimental (#51392)
  • Added partial concrete values for symbolic tracing (#51609)

Quantization

  • Quantized Operators and Modules
    • Embedding and EmbeddingBag operator support
      • creating quint4x2 dtype for quantized tensors (#44678)
      • PerChannelFloatQParams support for quint4x2 dtype (#45594)
      • Add 4-bit embedding_bag prepack/unpack support using quint4x2 (#45751)
      • Support 4-bit embedding_bag operators using the dtype quint4x2 (#45752)
      • Support for 4-bit quantized EmbeddingBag module (#45865)
      • Refactor qembeddingbag to remove duplicate code (#45881)
      • Rename the sparse argument for embedding_bag ops (#46003)
      • Add support for pruned weights in embedding_bag_byte lookup (#47329)
      • fp16 -> fp32 EmbeddingBag moved into CPU impl (#47076)
      • Add non-fbgemm fallback implementation for embedding lookup ops (#50706)
      • Out variant for embedding_bag_4bit_rowwise_offsets (#51324)
      • Using int32 as indices for embedding_bag operators (#45878)
    • Add transposed conv support for fbgemm backend for 1d, 2d, 3d (#46607, #46608)
    • Add quantized flip dispatch (#46235)
    • Add support for ReflectionPad2d (#48036)
    • Dynamic GRU quantization support (#49448)
    • Quantizable LSTM (#49671)
  • Quantization Flow/API
    • quantization: Linear + BatchNorm1d fusion (#50748)
    • compare_model_stub_fx API implementation (#48951)
    • Add additional_fuser_method_mapping to config (#46355)
    • Compare Weights FX Implementation (#48056)
    • Numeric Suite: Swap with shadow modules only for quantized part of model (#51052)
  • FX Graph Mode Quantization
    • Add prepare_custom_config_dict and convert_custom_config_dict (#46223, #46364)
    • Add FixedQParamsFakeQuantize module (#46657)
    • Add support for additional_fuse_method_mapping (#46345), additional_{fusion/quant}_pattern (#46346)
    • Support in qat sigmoid/hardsigmoid/tanh (#46871), convbn{relu}1d (#47248), FloatFunctional (#46634)
    • custom_module support static/dynamic/weight_only quant (#46786)
    • Support standalone_module_class (#47705)
    • Embedding/EmbeddingBag works in static quant qconfig (#48062)
    • Add MatchAllNode in pattern matching (#48979)
    • Add support for dynamic quant for RNN and RNNCell (#49126), ConvTranspose{n}d (#49717), quantizing functional linear + {functional relu/module relu} (#50975), functional conv2d + relu (#51079), functional conv1d and conv3d (#51155) (#51254), Scalar as first input for add/mul (#46751), leaky relu (#45712), Embedding (#46677), EmbeddingBag (#46678)
    • Remove inplace option for convert_fx (#46955)
    • Support non_traceable_module/module_class (#46298)
    • Add additional_object_mapping argument to convert (#46338)
    • Keep linear op unchanged when qconfig is not supported (#48067)
    • Move {input|output}_quantized_idxs cfg from convert to prepare (#49238)
    • Allow user to specify qconfig for call_method (#49621)
    • Do not observe bias on F.conv and F.linear (#49623, #49628)
    • Linear work with float_qparam_dynamic_qconfig (#47068)
    • Fix error that DefaultQuantizer is not inserted after a module configured with None qconfig (#47316)
    • Scope support for call_method in QuantizationTracer (#50173)
    • Support preserved_attributes in prepare_fx (#50306)
    • Add option to leave graph inputs and/or outputs quantized (#48624)
    • Support quantization for custom module (#44074)
    • Remove inplace option for fuse_fx (#46953) and prepare_fx (#46954)
    • Scope support for call_function in QuantizationTracer (#51086)

ONNX

  • Preprocess index_put with bool inputs to torch.masked_{scatter,fill} (#45584)
  • Export torch.{var,var_mean,std_mean} ops (#45678)
  • Enable NoneType inputs to export API (#45792)
  • Add export of prim::dtype, prim::tolist (#46019)
  • Enable onnx shape inference in export by default (#46629)
  • Add torch.silu operator support for onnx (#51519)
  • Support list remove for onnx export (#51526)
  • Added torch.hardswish symbolic in opset 9 (#48423)
  • Add export of aten::is_floating point (#46442)
  • Add torch.logical_{and,or,xor} torch op support in pytorch exporter (#50909)
  • Add torch.binary_cross_entropy_with_logits op to ONNX opset version 12 (#50908)
  • Support opset13 nn.Squeeze and nn.Unsqueeze (#50906)
  • Add export of prim::data (#45747)
  • Support torch.nonzero(*, as_tuple=True) export (#47421)
  • Update Reducesum operator for opset 13 (#50907)

Misc

  • Enable python code coverage on windows (#44548) and onnx (#47387)
  • Fix PyTorch compilation on Apple M1 chips (#48275, #49701)

Improvements

Python API

  • Add integer support (by promoting integer to float) to torch.{cos,sin,tan} (#45733, #46706), torch.log{2,10} (#46810), torch.{a}tanh (#47064), torch.a{cos, tan} (#47005), torch.a{cosh, sinh} (#47152), torch.sqrt (#47293), torch.log1p (#48002). torch.erf{c} (#48472), torch.asin (#48461), torch.sigmoid (#47551), torch.sinh (#48644), torch.cosh (#48923), torch.exp{2, m1}(#48926), torch.reciprocal (#49102), torch.erfinv (#49155), torch.rsqrt (#47909), torch.exp (#50093), torch.lgamma (#50140)
  • Add optional dtype argument to Tensor.view (#47951)
  • Add out optional arguments to torch.{reshape,flatten} (#51249), torch.tensordot (#47278), torch.fft.* (#49335), torch.narrow_copy (#49502)
  • Add support for int32 indices and offset in nn.Embedding and nn.EmbeddingBag (#46758)
  • Add boolean type support to torch.where (#47454), torch.mul and Tensor.__mul__ (#48637), torch.diag (#47455), torch.{all,any} (#44790), Tensor.to_dense (#50019)
  • Add inplace version of torch.cum{sum,prod}_ (#47651)
  • Add sparse support to torch.sqrt (#50088)
  • Add support for both dtype and ord arguments in torch.linalg.norm (#46637)
  • Make torch.nn Module accept batch size of 0: nn.ReplicationPad (#39137), nn.Unfold (#40689), nn.PixelShuffle (#49187), nn.AvgPool{1,2,3}d (#50008), nn.MultiLabelMarginLoss and nn.MultiMarginLoss (#50007)
  • utils.cpp_extensions Ensure default extra_compile_args are properly handled (#45956)
  • torch.LongTensor legacy construction improved error message (#46147)
  • torch.utils.checkpoint allow having Tensors that don’t require gradients (#45934)
  • torch.nan_to_num: fix deprecated warnings (#46309)
  • Remove more use of “blacklist” (#45512, #45781)
  • Add type annotation to submodules: torch.nn.cpp (#46490), torch.nn.parallel.comm (#46736), torch.nn.modules.* (#46828, #45772, #46013, #49957, #49479, #49045, #49035, #49494, #48969), autograd functions from c++ (#46622), torch.distributed functions from c++ (#46623), torch.storage (#46876), torch._tensor_str (#48463, #48584), torch.nn.modules.pooling (#48412), common_nn (#48190), torch.lobpcg (#47680), torch.nn.functional (#50106), torch.overrides (#50824), torch.generate_torch_version (#51637), torch.distributions (#45689), torch.quantization.quantize_jit (#45548), torch.utils.tensorboard (#49834), torch.multiprocessing (#47756), torch.cuda (#47134), torch._C._distributed_rpc (#46624), torch.distributed.* (#47531, #47532, #47533, #47534), torch.nn.parallel._functions (#49687)
  • Make comparison fail when dtypes don’t match (#47288)
  • Allow large inputs for torch.svd (#47440)
  • Add nondeterministic alerts to torch.index_copy, torch.median on CUDA and torch.kthvalue on CUDA (#46942)
  • Add float16 and bfloat16 support to torch.where (#49004), torch.matmul (#47873)
  • Add float16 support for CPU and bfloat16 support for CPU & CUDA to torch.flip and torch.flip{lr, ud} (#49895)
  • Add support for providing indices as a Tensor for torch.tensor_split (#49169)
  • Add support for SELU activation in torc.nn.init.calculate_gain (#50664)
  • Add function version of torch.optim optimizers and refactor existing classes to use the functional version: SGD (#45597), Adadelta (#50409), RMSProp (#50410), AdamW (#50411)
  • Improve error message when window is on wrong device for torch.fft.stft (#51128)
  • Add rounding_mode selection to torch.div (#51706, #52242)
  • Remove spurious numpy writable warning (#47271)
  • Enable deterministic mode for rocBLAS (#48654)
  • Hipify submodule revamp and improved integration with cpp_extensions (#48715)
  • Remove warning about saving state in torch.optim.lr_scheduler.LambdaLR (#46813)
  • Improve typing of torch.nn.Unflatten (#49838)
  • Add exception classification to torch.multiprocessing.spawn

Autograd

  • Add double backward checks for the torch.fft submodule (#46004)
  • Detect inplace modifications of views of leaf Tensors earlier to improve error (#46204)

torch.utils

  • data.TensorDataset: Add more specific error message (#46905)
  • data.DistributedSampler: Additional validation (#48865)

Complex Numbers

  • Improve error message thrown by torch.sign for complex tensors (#43280)
  • Remove unnecessary dtype checks for complex types and disable complex dispatch for CPU torch.{min,max} pointwise ops (#50465)

CUDA

  • Allow consumer ops to sync on autograd engine base gradient (#45787)
  • Add torch::cuda::nccl::{send,recv} (#45926)
  • Cusolver inverse check info (#46625)
  • Make numpy optional dependency for torch.cuda.amp (#48154)
  • Support all visible cards when building a cuda extension (#48891)
  • Enable using torch.utils.checkpoint.checkpoint and torch.cuda.amp at the same time (#49757)
  • Make DeviceCachingAllocator's error handling more defensive and a bit easier to read (#51158)

Distributed

  • Create NCCL communicator for send/recv on demand (#44922)
  • Reduce the peak memory of fp16 compression DDP comm hook by avoiding converting to fp32 (#46078)
  • Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221)
  • Add to the HashStore getNumKeys() (#46048) and deleteKey() (#46049)
  • Print exception message on both RPC caller and callee (#46372)
  • Add RRef proxy support for ScriptModule methods (#48339)
  • Support retrieving the RRef to the remote module (#48983)
  • Add a c++ interface in processGroup to get its backend name (#51066)
  • Enable NamedTuple data type to work with DDP (#44220)
  • Support send/recv to/from self when communicator is created on demand (#45873)
  • Add Error log when ProcessGroupNCCL takes down a process (#44988)
  • Provide additional information about NCCL error codes. (#45950)
  • Avoid scatter for single-device case in DDP (#46304)
  • Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set (#47926)
  • Providing more information while crashing a process in async error handling (#47246)
  • Add PowerSGD comm hook (#48060)
  • Define a customized state for PowerSGD comm hook (#48348)
  • Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507)
  • Replace the key of error_dict in PowerSGD state with bucket index (#48867)
  • Make CUDAFuture remember and restore current device in callback (#48789)
  • Update pipeline API to accept arbitrary sequence of Tensors and not just Tuple (#48467)
  • Use group.WORLD appropriately in process group initialization. (#48767)
  • Add error feedback to layerwise PowerSGD (#49418)
  • Warm-start of PowerSGD by reusing states from previous iteration is possible (#49451)
  • Change wait() to value() in some callbacks of PowerSGD communication hook (#49709)
  • Ensure DDP + Pipe works with find_unused_parameters. (#49908)
  • Enable TensorPipe CUDA sending to self (#50674) and GDR channel (#50763)
  • Add warning to distributed optimizer (#50630)
  • Make python object collective API args consistent (#50625)
  • Add option to make rref.get_type non-blocking. (#50977)
  • Unescape string in RPC error message (#49373)
  • Event Logging for NCCL Async Error Handling Process Crash (#47244)
  • Remove balance and devices parameter from Pipe. (#48432)
  • Error feedback for PowerSGD DDP comm hook (#48670)
  • Add an index field to GradBucket for PowerSGD (#48757)
  • Have FutureNCCL record streams w/ allocator in addCallback (#48496) and events in current stream (#48497)
  • Use fresh stream from pool for each FutureNCCL callback (#48498)
  • Record CUDA events for "follow-up" FutureNCCL inside markCompleted() (#48499)
  • Fix FutureNCCL's completed() disagreeing with wait() (#48503)
  • Fix FutureNCCL not recording DataPtrs with caching alloc in wait() (#48563)
  • Add multi-GPU support to FutureNCCL (#48500)
  • Don't store device indices separately on FutureNCCL (#48501)
  • Support wider range of types in FutureNCCL (#48502)
  • Split FutureNCCL's CUDA-specific parts from generic future logic (#48504)
  • Merge common parts of FutureNCCL into at::ivalue::Future (#48505)
  • Split out reusable CUDAFuture from FutureNCCL (#48506)
  • Cache the DataPtrs in CUDAFuture (#48788)
  • Modify Pipe to return an RRef. (#47829)
  • Cleanup APIs for pipeline parallelism. (#48630)
  • Fix TCPStore type coercion (#49685)
  • Simplify the implementation of error feedback and warm-start (#50981)
  • Explicitly specify the dtype of the error tensor (#50985)
  • Check start_PowerSGD_iter > 1 and add guidance on tuning PowerSGD configs. (#51427)
  • Check if the backend is NCCL when a DDP communication hook is registered (#51759)

TorchScript

  • Add multiline string dedent support (#45580)
  • Add string versions of argument funcs in jit Node (#45464)
  • Make sure each warnings.warn only executes once inside TorchScript. (#45382)
  • Allow slicing multiple dimensions with indexes if not Tuple (#45239)
  • Change type inferred from empty annotation (#45360)
  • Fix stride printing/parsing formatting (#45156)
  • Make objects throw Python AttributeError on nonexistant attr access (#45911)
  • Make InsertInstruction overflow check a warning instead of fatal (#46369)
  • Add an option to getWriteableTensorData to avoid copy CUDA tensor to CPU (#46524)
  • Add error messages and workaround for RET failure of containers with a torch class type (#46543)
  • Correctly mark unannotated NamedTuple field to be inferred TensorType (#46969)
  • Enable ModuleDict non-literal indexing (#45716)
  • Add an attribute to the torchscript model exported by metal (#47174)
  • Print out interface mismatch for prim::ModuleDictIndex (#47300)
  • better message for bad type annotation (#47464)
  • Resolve string literal type annotations using Resolver::resolveType (#47731)
  • Resolve torch.device in recursive compilation of classes (#47734)
  • Metacompile boolean constants (#46721)
  • Allow JIT unpickler to accept CUDA DataPtr from read_record_ (#46827)
  • Skip None submodule during JIT-tracing (#49765)
  • Add __prepare_scriptable__ duck typing to allow replacing nn.Modules with scriptable preparations (#45645) (#49242)
  • Fix deprecation warning in scalar_type_analysis (#50218)
  • Support scripting classmethod called with object instances (#49967)
  • Use FileStore in TorchScript for store registry (#50248)
  • Treat has_torch_function and object_has_torch_function as static False when scripting (#48966)
  • Print better error when class attribute IValue conversion fails (#50255)
  • Clean up some type annotations in test/jit/...../test_class_type.py (#50156)
  • Type annotations in test/jit (#50293)
  • Eliminate static default_extra_files_mobile from header import.h (#50832)
  • Dump torch::jit::AliasDb objects as Graphviz files (#50452)
  • Fix test_jit_cuda_archflags on machine with more than one arch (#50405)
  • Provide more info when attribute fails to convert (#50870)
  • Adding correct error message for for..else (#51258)
  • Handle error during dict expansion (#51374)

Mobile

  • Update default output extension in optimize_for_mobile.cc (#45598)
  • Add named tuple's error message and workaround for RET failure (#46347)
  • [Metal] Add metal backend type (#46455)
  • [Metal] Add the Python binding for optimize_for_mobile (#46456)
  • [Metal] Add pin_memory check in empty_strided (#47228)
  • [Metal] Calculate strides for metal tensors (#50309)
  • [Metal] Clean up the operator tests (#50311)
  • Add an overload for deserialize() that doesn't accept the extra_files map. (#50932)
  • bundled_inputs: Preserve bundled input related methods when calling optimize_for_mobile (#49170)
  • bundled_inputs: Preserved all functions generated by bundled inputs (#51496)
  • bundled_inputs: Expanded Bundled Inputs To Any Public Function (#51153)
  • Expose _export_operator_list to python (#51312)

Quantization

  • Quantized Operators and Modules
    • Add reflection padding to conv (#49011)
    • Add support for 2D indices for quantized embedding operators (#47766)
    • quantize_tensor_per_channel ARM implementation (#46018)
    • Support either min or max in qclamp (#45937)
    • Add preliminary support for advanced indexing (#49346)
    • Add backend_independent option for quantized linear module (#48192)
    • Add out-variant for the reflection pad (#48037)
    • Support 2 dim input in quantized batchnorm 1d (#51597)
  • Typing, Formatting, Error Messages, Logging and Tests
    • numeric suite: add types to eager (#51168)
    • Enable type check for torch.quantization.fake_quantize (#45701)
    • Type check for torch.quantization.observer (#45630), torch.quantization._numeric_suite (#46330), torch.quantization.stubs (#46475), quantization.fx.Quantizer (#48343), quantization.fx.Quantizer (#48350), quantization_mappings.py (#49179), fusion_patterns.py (#49606), torch/nn/quantized/modules (#49941), quantization-related files in torch/jit (#49939), fuser (#48844), quantization_patterns (#48851), observed_module.py (#49607), quantization (#49942)
    • Enable mypy on torch/quantization/fx/* (#48331)
    • Make each line of fx/quantize.py <=80 chars (#48357)
    • Add more typehints (#48774, #48794, #48792)
    • Nice error message on convtranspose with per-channel weight (#49899)
    • Throw a nice error message for allclose with quantized inputs (#49802)
    • Add type annotations to torch.nn.quantized.modules.conv (#49702)
    • Add type annotations to conv_fused/blas_compare/blas_compare_setup (#51235)
    • Add API usage logging to numeric suite (#46504) and quantization (#46095)
  • Sparsity
    • Block Sparse kernel (#50585)
    • Add A matrix pretransformed based sparse kernels for linear (#50587)
    • Add dyanmic linear sparse kernel for arm64 (#50591)
  • Others
    • Use tensor's quantized properties directly in pickler (#46267)
    • Remove register api and rename get_mapping to get_default_mapping (#46337)
    • Update HistogramObserver to be scriptable (#51081)
    • Support varying size input in numeric suite (#47391)
    • Backend string for the quantized types (#49965)
    • Disable pruning on embedding look up operators when compressed_indices_mapping = {0} (#48672)
    • Support out variant of embedding_bag_byte_rowwise_offsets_out (#49561)

ONNX

  • Update embedding_bag export (#44693)
  • Improve error handling for adaptive_pool (#45874)
  • Support nd mask index in opset >= 11 (#45252)
  • Update peephole pass for prim::ListUnpack (#46264)
  • Slightly improve indexing with ellipsis under scripting (#46571)
  • Update batch_norm symbolic to handle track_running_stats=False (#47135)
  • Cast Gather index to Long if needed (#47653)
  • Handle dynamic input axes for prim_ConstantChunk (#48176)
  • Remove usage of isCompleteTensor() in symbolic functions (#48162)
  • Changes to export API to better handle named arguments (#47367)
  • Modified var_mean symbolic to support more combinations of dims (#48949)
  • Support gelu for fp16 export (#50911)
  • Enable Constant Folding for ONNX Opset 13 (#51523)
  • Export and shape inference for prim uninitialized in If subblock (#46094)
  • Scripting support for inputs to index_put (#46866)
  • Track and list model params for scripting (#47348)
  • Modifications in remove inplace ops passes to better handle binary inplace ops (#51572)
  • Improve error message for parse_arg in symbolic functions (#51516)
  • Update error message that displays when encountering an op unsupported for ONNX export (#51522)
  • Preserve param names during in-place op removal (#50955)
  • Handle sequence output shape and type inference (#50599)
  • Update constant-folding of Gather op to include cases where rank of indices input is 0 (#51514)
  • Update unsafe_chunk() method to support new version 13 of Split operator (#51524)
  • Replace optional parameters of Resize with placeholder for ops13 (#50954)

Vulkan

This release brings about a complete rewrite of PyTorch’s Vulkan backend with primary focus on improved performance, robustness, and better code structure and organization. These changes are transparent to the end user. Considering that this is a rewrite, many of these changes also qualify as performance improvements.

Misc

  • Factory operators (at::empty, at::zeroes,...) now have a new overload in the C++ API that takes ScalarType, Layout, Device and pin_memory parameters separately, in addition to the previously existing overload that takes one TensorOptions argument. (#44087)

Bug fixes

Python API

  • Fix torch.nn.BatchNorm{1,2,3}d channels_last contiguity check (#50659)
  • Fix torch.nn.ConstantPadNd not preserving memory format (#50898)
  • Fix dtype of first sample in torch.quasirandom.SobolEngine (#51578)
  • Fixes bug in torch.sspaddmm (#45963)
  • Check support_as_strided before using torch.empty_strided (#46746)
  • Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831)
  • Fix negative column numbers for torch.eye (#46841)
  • Fix segfault with torch.orgqr (#46700)
  • Fix torch.nn.functional.embedding padding_idx behavior (#46714)
  • Fix torch.nn.Embedding.from_pretrained to properly handle the padding_idx argument (#47184)
  • Fix functions not handling discontiguous Tensors properly: torch.dropout (#47552), torch.median (#46917)
  • Fix max_pool2d with ceil_mode (#46558)
  • Fix type promotion for torch.trace on CPU (#47305)
  • Fix torch.kthvalue error for scalar input (#47600)
  • Fix multinomial when input has 0 probability (#47386)
  • Fix incorrect warnings in torch.nn.Parameter{List,Dict} (#48315)
  • Fix printing of torch.device (#48655)
  • Fix parameter generator exhaustion in torch.optim.SparseAdam (#47724)
  • Fix torch.pow bug for complex exponents (#49809)
  • Fix gradient for torch.norm when p=+inf (#48611)
  • Fix SyncBatchNorm when stats tracking is disabled (#50126)
  • Fix torch.elu backward when alpha is negative (#49272)
  • Fix pickling for Tensor-like objects (#47732)
  • Fix torch.distributions.Half{Cauchy,Normal} support for validate_args=True (#50403, #50492)
  • Fix torch.distributions.CatTransform for event_dim > 0 (#49111)
  • Fix torch.distributions.Binomial to retain lazy logit initialization (#46055)
  • Fix torch.pow when exponent is provided as a scalar Tensor and on different device (#46185, #46320)
  • Fix classmethod override argument passing for Tensor-like objects (#47114)
  • Fix internal assert when inputs are on the wrong device for torch.{maximum, minimum} (#48446)
  • Fix torch.distributions.utils.broadcast_all crashing on Tensor-like objects (#48169)
  • Fix vectorized conversion of -nan from float16 to float32 (#41280)
  • Fix torch.silu backward for all backends other than CPU and CUDA (#49439)
  • Fix wrong output when torch.kthvalue out= argument overlaps with input (#48254)
  • Fix advanced indexing for Tensor-like objects (#49324)
  • Fix torch.distributions.TransformedDistribution shape logic(#50581)
  • Fix torch.nn.functional.interpolate backward on GPU for nearest interpolation (#51240)
  • Fix torch.svd ignoring some keyword argument for empty inputs (#51109)
  • Fix torch.distributions.Dirichlet arg_constraints (#51369)
  • Use deterministic implementation of torch.index_put and torch.index backward CPU in deterministic mode (#51388)
  • Removes spurious warning in torch.nonzero (#51618)
  • Fix calculation of number of elements to not overflow in many c++ implementations (#46997)
  • Fix Parameter detection as Tensor in c++ backend (#48963)
  • Fix bug in miopen findAlgorithm (#46852)

Autograd

  • Fix deadlock on Windows due to bad thread termination in autograd engine (#43532)
  • Fix deadlock in tsan builds due to bad locking in the engine (#45867)
  • Avoid NaN values in torch.cdist backward for p<1 (#45720)
  • Fix handling of requires_grad arg for torch.new_{full,empty,zeros} (#46486)
  • Fix inplace check logic to be triggered when written-to Tensor does not require gradients (#46296)
  • Set proper output differentiability for torch.unique (#47930), torch.count_nonzero (#50866)
  • Fix race in autograd engine that lead can lead to std::out_of_range error (#50164, #50372)
  • Fix autograd thread crash on destruction with python-3.9 (#50998)
  • Fix autograd side effects when printing (#51364)
  • Fix memory leak in anomaly mode (#51610)
  • fix torch.hardsigmoid backward at boundary values (#51454)

CUDA

  • Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted (#45248)
  • Ensure kernel launches are checked (#46474, #46727)
  • Fix bit math (#46837)
  • Fix test_inverse_singular for cublas path; fix cusolver inverse multi-stream issue (#47026)
  • Fix indices computation for trilinear interpolate backwards (#50084)
  • Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 (#50110)
  • Disable cuDNN persistent RNN on sm_86 devices (#49534)
  • Fix Error with torch.flip for cuda tensors when dims=() (#50325)
  • Fix replication_pad CUDA launch configuration (#50565)
  • Workaround for MAGMA accessing illegal memory in batched cholesky (#50957)
  • Fix torch.cdist backward CUDA error due to illegal gridDim setting (#51569)
  • Prevent CUDAFuture from using uninitialized device index (#51505)
  • Fix incorrect usage of CUDACachingAllocator (#48817)
  • Fix torch.cuda.memory_allocated to return {} if not initialized (#51179)
  • Fix crash when trying to reset memory stats when no cuda device is available (#48406)

torch.utils

  • data.DistributedSampler: Fix possible padding length overflow (#45329)
  • data.DataLoader: Fix hang with large sampler (#48669)
  • data.DataLoader: Fix unintended error when worker force kill happens #43455 (#43462)
  • data.DataLoader: Fix persistent_workers + pin_memory (#48543)

Complex Number

  • Make torch.view_as_real raise a proper error for backends where it is not supported (#47018)
  • Fix bug in toComplexWithDefault (#43841)
  • Fix torch.cat backward formula to return correct gradient values for R -> C case (#51681)
  • Update backward formulas for torch.{add, sub} to correctly handle R -> C case. (#46596)
  • Add custom implementation for torch.csqrt if libc++ is used (#52018)

C++ API

  • Refine ConvParams::use_nnpack() to allow NNPACK convolution algorithm only be used for kernels up to 16x16.(#49464)

Distributed

  • Record FutureNCCL callback stream on CUDA caching allocator (#45318)
  • Fix object-based collectives API to use torch.cuda.current_device instead of rank (#46897)
  • Explicitly restrict the scope of torch.cuda.synchronize to the current device in PowerSGD (#49711)
  • Fix Hang in Async Error Handling due to Work logging (#46265)
  • Add missing recordStream in ProcessGroupNCCL::alltoall_base (#46603)
  • Allow DataParallel to run zero input Module (#46565)
  • Fix DDP issue where parameters share same grad_accumulator (#46755)
  • Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946)
  • Refactor RPC matchBuiltInOp to get rid of exception swallowing (#49009)
  • Solve zombie process problem in DDP launcher (#49305)
  • Fix memory leak in TensorPipeAgent. (#50564)
  • Fix warm-start for PowerSGD layer-wise compression (#50283)
  • Fix CUDA RPC Stream Synchronization (#50949)
  • Fix benchmarks/distributed/ddp/benchmark.py (#51095)
  • Fix store based barrier to only use add (#49930)

Mobile

  • Fix out-of-bounds access for caching allocator calls (#46439)
  • Fix CPUCaching allocator guard bug (#46922)
  • [Metal] Make the dst tensor contiguous when copying from metal (25833e5)
  • [Metal] Fix the broken strides value for 2d transpose (#50310)
  • [Android] Fix yuv conversion (#50951)

TorchScript

  • Fix bugs in a number of ops in CUDA fuser (#47795, #49143, #49396 ,#48329 and others)
  • Fix dict update (#45857)
  • Fix Dict bug in constant hashing (#45929)
  • Fix TypeError when torch.jit.load is passed a pathlib.Path (#45825)
  • Fix missing call to __setstate__ when cloning modules (#45858)
  • Prevent caching of graph attribute. (#46960)
  • Fix traced training attribute (#47211)
  • Correctly compare Stream IValues (#47303)
  • Correctly print out sign of near-zero double values (#47081)
  • Properly serialize types that only appear at function input (#47775)
  • Fix bug in get_annotation_str for ast.Subscript (#48741)
  • Fix include files for out-of-tree compilation (#48827)
  • Fix constant propagation schemas (#49605)
  • Fix return type Any for Ternary ops (#49165)
  • Fix for module_has_exports (#50680)
  • Properly convert Python strings implictly to device (#51340)
  • Add missing support for torch.jit.Final in python 3.6 (#47393)

torch.fx

  • Fix recursion depth issue on Graph deepcopy (#46669)
  • Fix handling of inf and nan literals (#46894)
  • Fix corner case in name sanitization (#46958)
  • Fix submodule naming for subgraph split (#47869)
  • Fix create_arg for NamedTuple (#48986)
  • Fix python code having spurious newlines from placeholders (#49720)
  • Make split_module results deterministic (#50470)
  • Fix tracing a free function with embedded constant (#50639)
  • Fix using fx.wrap as a decorator (#50677)
  • Fix annotation in generated code (#50777, #52021)

Quantization

  • Remove fake_quant after add/mul nodes during eager mode QAT (#49213)
  • torch.mean add path for unsupported QNNPACK modes (#45533)
  • Set type for GetAttr nodes in remapTypes (#46250)
  • Avoid inserting fakequant for sigmoid/hardsigmoid/tanh in eval mode (#47297)
  • Ensure observer respects device affinity (#47514)
  • Fix quant type classification for float_qparam qconfig (#48069)
  • Fix quant_type classification for fp16, fp16 (#48073)
  • Fix a bug in leakyReLU (#48265)
  • Fix quantization for qat.ConvBnReLU1d (#48059)
  • Add bias once in conv_fused (#48593)
  • Do not return unitialized qschame from getQSchemeAndQParamVector (#49391)
  • Fix quantization for DeQuantStub (#49428)
  • Ensure observers do not crash for empty Tensors (#49800)
  • fake_quant: fix device affinity and buffer resizing for state_dict (#50868)
  • Fix memory leak in qnnpack ops (#51612)
  • Remove set_quantizer_ from native_functions.yaml (#49463)
  • Make choose_qparams_optimized return Tensors to preserve dtype (#45530)
  • Use PlaceholderObserver as default dynamic quant observer (#45343)
  • FixedQParamsFakeQuantize: adjust default quant_min and quant_max (#47423)
  • Add bias once in conv_fused (#48593) (#48661)
  • Fix unused var warning when building for different archs. (#48730)
  • Make the CUDA fake quantize logic consistent with CPU fake quantize logic (#49808)
  • eager quant: fix error with removing forward hooks (#49813)

ONNX

  • Fix torch.flatten operator (#45632)
  • Reimplement _var_mean to ensure non-negative (#47240)
  • Fix scripting of torch.{rand,randn,where} (#45793)
  • Fix torch.eye export (#47016)
  • Fix dtype for log_softmax export (#46627)
  • Fix graph position to insert clone node for inplace op removal (#51520)
  • Fix graph sequence output from loop node (#51521)
  • Do not dereference nullptr in scalar type analysis (#50237)
  • Fix bug in torch.unfold symbolic (#51515)
  • Fix opset 11 ConstantChunk with negative dim (#51525)
  • Fix bug in scatter_add (#51527)

Vulkan

  • Fix interval midpoint calculation (#46839)
  • Fix Vulkan torch.empty (and family) breakage as a result of API update. (#47937)
  • Fix Addmm prepacking to persist after GPU flush (#48313)
  • Properly forbid dilation > 1 for conv2d (#48800)

Misc

  • Fix c++ extension ninja CUDA build (#49344)
  • Only include dataclasses for py < 3.8 to make setup.py compatible with older python versions (#45611)

Performance

Python API

  • Rewrite torch.kron to improve performance and support more dtypes (#50927)
  • Enable the faster combined weight branch in MHA when query/key/value is same object with NaN (#48126)

Autograd

  • autograd.gradcheck update to reduce computations (#45757)
  • Reduce memory usage for torch.mm when only one input requires gradient (#45777)
  • Reduce autograd engine startup cost (#47592)
  • Make torch.svd backward formula more memory and computationally efficient. (#50109)

CUDA

  • Fix perfornance issue of GroupNorm on CUDA when feature map is small. (#46170)
  • Concat fast path with empty tensor (#46805)
  • Support the strided tensor on input for torch.cat (#46859)
  • Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878)
  • Add proper maximum number of threads per block for sm_86 as 1536 (#45889)
  • Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
  • Improve performance of CUDA trilinear interpolate backward (#52649)

C++ API

  • Avoid computing AutogradKey if not needed to speed up low level C++ calls (#46252)
  • VariableKernel calls into scattered C++ api (#44158)
  • Make validate debug-only in Device constructor (#49123)
  • Add macro to optionally devirtualize TensorImpl::numel() (#49766) and TensorImpl::sizes() (#50176)
  • Inline access to low level Dispatcher (#50644)

Distributed

  • Only track variables with grad accumulator for find_unused_parameters=True in DDP to save memory (#45942)
  • Benchmark combining Distributed Data Parallel and Distributed RPC (#46993)
  • Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
  • Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) (#46901)

TorchScript

  • Optimized hot path in JIT graph executor (#47465, #48061,#48034)
  • Added support for is_nan, to, and lgamma in CUDA fuser(#45791, #48973, #48976)
  • Added additional optimizations as part of torch.jit.freeze (Conv-Batchnorm, Conv-Add, and Conv-Mul folding, Dropout Removal) (#50222).
  • Fast TypeMeta/ScalarType conversion (#45544)
  • Fix getCustomClassType() perf (#48981)
  • Avoid move-constructing a List in listConstruct (#49355)
  • Specialize list_element_from for IValue to avoid extra move/copy (#50124)

Mobile

  • Avoid inlining kernel lambdas on mobile (#46249)
  • Free original weight after prepacking in XNNPACK based op (#46541)
  • [Metal] Make permuteWeights inline (#47634)
  • [Metal] Use MPSCNN kernels for binary elementwise ops (c18403a)

Vulkan

  • Enable prepacked addmm/mm for linear layers (#47815)
  • Tweak memory use. (#47728)
  • Add linear memory allocator. (#48569)
  • Optimize Vulkan command buffer submission rate. (#49112)

torch.fx

  • Speed up non-parameter tensor lookup (#47325)

Quantization

  • Parallelize the quantization conversion operators (#45536)
  • Add a more memory efficient version of fake quant (#50561)
  • mem-efficient learnable fake quantization (#49315, #51255, #51159)
  • Remove contiguous calls in qembeddingbag (#48993)
  • Update embedding module to not store qweight (#50418)

Misc

  • Extra sampling of record function events for the profiler (#49114)

Documentation

Python API

  • Add information how to control randomness in DataLoader (#45749)
  • Revamp reproducibility notes (#45748)
  • Revamp torch.optim doc for better understanding (#45944)
  • Revamp torch.sparse tensor documentation. (#45400)
  • Add doc for torch.overrides submodule. (#48170)
  • Add note on nn.Module overview and design principles (#51536)
  • Add helper functions section to torch.fft doc (#46032)
  • Add object-based collective APIs to public docs (#48909)
  • Fix diverse typos and rendering issues in torch. doc (#46328, #46589, #47545, #48316, #48328, #48673, #48787, #47762, #48970, #49136, #49388, #49413, #49584, #49667, #41887, #50254, #51053, #51212, #51439, #51286, #49648)
  • Fix diverse typo and rendering issues in torch.nn doc (#45662, #45660, #45587, #45763, #46853, #48577, #48775, #49950, #50430, #48596)
  • Fix diverse typo and rendering issues in torch.linalg doc (#51459, #51353, #51620, #51641, #51651, #51658, #51659, #51660)
  • Update docs for torch.nn: in-place modification of weight in nn.Embedding (#45595)
  • Update docs for torch.distributions: NegativeBinomial (#45693), Categorical (#45804), LKJCholesky (#52904)
  • Improve torch.matmul doc regarding broadcasting (#45699)
  • Add function signature for torch.pixel_shuffle (#45661)
  • Fix signature for torch.poisson (#45656)
  • Add 3D reduction example to torch.tensordot (#45697)
  • Fix torch.matrix_exp (#45909)
  • Fix typo in torch.load docstring for the f parameter (#49350)
  • Document fix for torch.logspace and torch.linspace (#46056)
  • Improve clarity of torch.norm (#42696)
  • Fix info on the shape of pivots in torch.lu (#46844)
  • Add generator param in torch.randperm doc (#47231)
  • Updated doc for torch.{v}dot (#47242)
  • Update doc of torch.eig about backward(#47598)
  • Fix torch.swap{dim/axes} to properly appear in doc (#48376)
  • Add global nn.Module hooks to nn doc (#48374)
  • Added torch.linalg.cond to doc(#48941)
  • Improve new_group example in the context of torch.nn.SyncBatchNorm (#48897)
  • Update is_floating_point() docs to mention bfloat16 (#49611)
  • Improve docs for torch.{scatter,gather} (#49679)
  • Rename "Arguments:" to "Args:" in all doc (#49736)
  • Fix a KaTeX crash and many docstring issues (#49684)
  • Improve torch.flatten doc (#49501)
  • Add note about torch.flip returning new tensor and not view. (#50041)
  • Add instructional error message for cudnn RNN double backward workaround (#33884)
  • Add centered FFT example to torch.fft.fftshift doc (#51223)
  • Add torch.sgn to doc (#51479)

Autograd

Complex Number

  • Fix typo in complex autograd docs (#49755)
  • Doc update for complex numbers (#51129, #51661)
  • Document that torch.remainder does not support complex inputs (#48024)

CUDA

  • Add a Note on CUDA Stream (#45754), #45754)
  • Add docs on how to toggle TF32 flags on C++ (#47331)
  • Fix syntax issue in C++ cuda api note (#48434)
  • Change “truncating” to “rounding“ in TF32 docs (#49625)
  • Add docstring to torch.cuda.get_device_properties (#49792)
  • Add doc for cuda.memory_fraction and cuda.gpu_process (#51372)

C++ API

  • Add guide for choosing dispatch keys in native_functions.yaml (#46126)
  • Add a few more comments on dispatch key computation methods (#46128)
  • Improve error messages for operator registration API (#47636)
  • Add Math/DefaultBackend to dispatch key guide, introduce PythonDispatcher (#50854)

Distributed

  • Clarify callback behavior when future is completed (#50978)
  • Enhance new_group doc to mention using NCCL concurrently. (#48872)
  • Adding c10d Store API Docs (#45543)
  • Fix distributed documentation for asynchronous collective Work objects (#45709)
  • Fix DDP documentation (#46861)
  • Fix inaccurate note in DistributedDataParallel (#47156)
  • Minor doc fixes for init_process_group (#47644)
  • Docs fixes for HashStore API (#47643)
  • Update links in DDP note (#47663)
  • Small documentation changes for RRef and Dist Autograd (#48123)
  • Add examples for new object-based c10d APIs (#43932)
  • Minor update of the comments on PowerSGD. (#49246)
  • Updating init_process_group docs to indicate correct rank range (#49131)
  • Store Python API Docs Fixes (#49130)
  • Fix link in distributed contributing doc and add link (#49141)
  • Updating Docs to Reflect FileStore changes (#49557)
  • Improve documentation for pipeline parallelism. (#48638)
  • Reorder torch.distributed.rpc.init_rpc docstring arguments (#50419)
  • Add documentation page for pipeline parallelism. (#50791)
  • Update the doc of DistributedOptimizer (#51314)
  • Fix doc inconsistency about callback args in torch.futures.Future (#50979)

TorchScript

  • Added a developer tutorial for tensor expressions - the core technology used in CUDA fuser (#45527)
  • Fix jit model loading example (#48104)
  • Fix archive file extension in examples and docs (#50649)
  • Fix ScriptModule docstring (#48608)
  • Clarify logic in ir_emitter (#51299)

torch.fx

Read more