Skip to content

Releases: pytorch/pytorch

New TorchScript API with Improved Python Language Coverage, Expanded ONNX Export, NN.Transformer

08 Aug 16:06
Compare
Choose a tag to compare

We have just released PyTorch v1.2.0.

It has over 1,900 commits and contains a significant amount of effort in areas spanning JIT, ONNX, Distributed, as well as Performance and Eager Frontend Improvements.

Highlights

[JIT] New TorchScript API

Version 1.2 includes a new, easier-to-use API for converting nn.Modules into ScriptModules. A sample usage is:

class MyModule(torch.nn.Module):
    ...

# Construct an nn.Module instance
module = MyModule(args)

# Pass it to `torch.jit.script` to compile it into a ScriptModule.
my_torchscript_module = torch.jit.script(module)

torch.jit.script() will attempt to recursively compile the given nn.Module, including any submodules or methods called from forward(). See the migration guide for more info on what's changed and how to migrate.

[JIT] Improved TorchScript Python language coverage

In 1.2, TorchScript has significantly improved its support for Python language constructs and Python's standard library. Highlights include:

  • Early returns, breaks and continues.
  • Iterator-based constructs, like for..in loops, zip(), and enumerate().
  • NamedTuples.
  • math and string library support.
  • Support for most Python builtin functions.

See the detailed notes below for more information.

Expanded Onnx Export

In PyTorch 1.2, working with Microsoft, we’ve added full support to export ONNX Opset versions 7(v1.2), 8(v1.3), 9(v1.4) and 10 (v1.5). We’ve have also enhanced the constant folding pass to support Opset 10, the latest available version of ONNX. Additionally, users now are able to register their own symbolic to export custom ops, and specify the dynamic dimensions of inputs during export. Here is a summary of the all of the major improvements:

  • Support for multiple Opsets including the ability to export dropout, slice, flip and interpolate in Opset 10.
  • Improvements to ScriptModule including support for multiple outputs, tensor factories and tuples as inputs and outputs.
  • More than a dozen additional PyTorch operators supported including the ability to export a custom operator.

Updated docs can be found here and also a refreshed tutorial using ONNXRuntime can be found here.

Tensorboard is no Longer Considered Experimental

Read the documentation or simply type fromtorch.utils.tensorboardimport SummaryWriter to get started!

NN.Transformer

We include a standard nn.Transformer module, based on the paper “Attention is All You Need”. The nn.Transformer module relies entirely on an attention mechanism to draw global dependencies between input and output. The individual components of the nn.Transformer module are designed so they can be adopted independently. For example, the nn.TransformerEncoder can be used by itself, without the larger nn.Transformer. New APIs include:

  • nn.Transformer
  • nn.TransformerEncoder and nn.TransformerEncoderLayer
  • nn.TransformerDecoder and nn.TransformerDecoderLayer

See the Transformer Layers documentation for more info.

Breaking Changes

Comparison operations (lt (<), le (<=), gt (>), ge (>=), eq (==), ne, (!=) ) return dtype has changed from torch.uint8 to torch.bool (21113)

Version 1.1:

>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([1, 0, 0], dtype=torch.uint8)

Version 1.2:

>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([True, False, False])

For most programs, we don't expect that any changes will need to be made as a result of this change. There are a couple of possible exceptions listed below.

Mask Inversion

In prior versions of PyTorch, the idiomatic way to invert a mask was to call 1 - mask. This behavior is no longer supported; use the ~ or bitwise_not() operator instead.

Version 1.1:

>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([0, 1, 1], dtype=torch.uint8)

Version 1.2:

>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported.
If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.

>>> ~(torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([False,  True,  True])

sum(Tensor) (python built-in) does not upcast dtype like torch.sum

Python's built-in sum returns results in the same dtype as the tensor itself, so it will not return the expected result if the value of the sum cannot be represented in the dtype of the tensor.

Version 1.1:

# value can be represented in result dtype
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)
tensor(3, dtype=torch.uint8)

# value can NOT be represented in result dtype
>>> sum(torch.ones((300,)) > 0)
tensor(44, dtype=torch.uint8)

# torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)
tensor(300)

Version 1.2:

# value cannot be represented in result dtype (now torch.bool)
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)
tensor(True)

# value cannot be represented in result dtype
>>> sum(torch.ones((300,)) > 0)
tensor(True)

# torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)
tensor(300)

TLDR: use torch.sum instead of the built-in sum. Note that the built-in sum() behavior will more closely resemble torch.sum in the next release.

Note also that masking via torch.uint8 Tensors is now deprecated, see the Deprecations section for more information.

__invert__ / ~: now calls torch.bitwise_not instead of 1 - tensor and is supported for all integral+Boolean dtypes instead of only torch.uint8. (22326)

Version 1.1:

>>> ~torch.arange(8, dtype=torch.uint8)
tensor([ 1, 0, 255, 254, 253, 252, 251, 250], dtype=torch.uint8)

Version 1.2:

>>> ~torch.arange(8, dtype=torch.uint8)
tensor([255, 254, 253, 252, 251, 250, 249, 248], dtype=torch.uint8)

torch.tensor(bool) and torch.as_tensor(bool) now infer torch.bool dtype instead of torch.uint8. (19097)

Version 1.1:

>>> torch.tensor([True, False])
tensor([1, 0], dtype=torch.uint8)

Version 1.2:

>>> torch.tensor([True, False])
tensor([ True, False])

nn.BatchNorm{1,2,3}D: gamma (weight) is now initialized to all 1s rather than randomly initialized from U(0, 1). (13774)

Version 1.1:

>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:
tensor([0.1635, 0.7512, 0.4130, 0.6875, 0.5496], 
       requires_grad=True)

Version 1.2:

>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:
tensor([1., 1., 1., 1., 1.], requires_grad=True)

A number of deprecated Linear Algebra operators have been removed (22841)

Removed Use Instead
btrifact lu
btrifact_with_info lu with get_infos=True
btrisolve lu_solve
btriunpack lu_unpack
gesv solve
pstrf cholesky
potrf cholesky
potri cholesky_inverse
potrs cholesky_solve
trtrs triangular_solve

Sparse Tensors: Changing the sparsity of a Tensor through .data is no longer supported. (17072)

>>> x = torch.randn(2,3)
>>> x.data = torch.sparse_coo_tensor((2, 3))
RuntimeError: Attempted to call `variable.set_data(tensor)`,
but `variable` and  `tensor` have incompatible tensor type.

Sparse Tensors: in-place shape modifications of Dense Tensor Constructor Arguments will no longer modify the Sparse Tensor itself (20614)

Version 1.1:

>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)

>>> s.coalesce().indices().shape
torch.Size([1, 1])

>>> s.coalesce().values().shape
torch.Size([1])

Notice indices() and values() reflect the resized tensor shapes.

Version 1.2:

>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)

>>> s.coalesce().indices().shape
torch.Size([1, 2])

>>> s.coalesce().values().shape
torch.Size([2])

Notice indices() and values() reflect the original tensor shapes.

Sparse Tensors: Accumulating dense gradients into a sparse .grad will no longer retain Python object identity. (17072)

Version 1.1:

>>> m = torch.nn.Embedding(10, 3, sparse=True)
>>> m(torch.tensor([[1,2,4,5],[4,3,2,9]])).sum().backward()
>>> assert m.weight.grad.layout == torch.sparse_coo
>>> m_weight_grad_saved = m.weight.grad

# accumulate dense gradient into sparse .grad, change sparsity
>>> m.weigh...
Read more

Official TensorBoard Support, Attributes, Dicts, Lists and User-defined types in JIT / TorchScript, Improved Distributed

01 May 00:09
Compare
Choose a tag to compare

Note: CUDA 8.0 is no longer supported

Highlights

TensorBoard (currently experimental)

First-class and native support for visualization and model debugging with TensorBoard, a web application suite for inspecting and understanding training runs, tensors, and graphs. PyTorch now supports TensorBoard logging with a simple from torch.utils.tensorboard import SummaryWriter command. Histograms, embeddings, scalars, images, text, graphs, and more can be visualized across training runs. TensorBoard support is currently experimental. You can browse the docs here.

[JIT] Attributes in ScriptModules

Attributes can be assigned on a ScriptModule by wrapping them with torch.jit.Attribute and specifying the type. Attributes are similar to parameters or buffers, but can be of any type. They will be serialized along with any paramters/buffers when you call torch.jit.save(), so they are a great way to store arbitrary state in your model. See the docs for more info.

Example:

class Foo(torch.jit.ScriptModule):
  def __init__(self, a_dict):
    super(Foo, self).__init__(False)
    self.words = torch.jit.Attribute([], List[str])
    self.some_dict = torch.jit.Attribute(a_dict, Dict[str, int])

  @torch.jit.script_method
  def forward(self, input: str) -> int:
    self.words.append(input)
    return self.some_dict[input]

[JIT] Dictionary and List Support in TorchScript

TorchScript now has robust support for list and dictionary types. They behave much like Python lists and dictionaries, supporting most built-in methods, as well as simple comprehensions and for…in constructs.

[JIT] User-defined classes in TorchScript (experimental)

For more complex stateful operations, TorchScript now supports annotating a class with @torch.jit.script. Classes used this way can be JIT-compiled and loaded in C++ like other TorchScript modules. See the docs for more info.

@torch.jit.script
class Pair:
	def __init__(self, first, second)
		self.first = first
		self.second = second

	def sum(self):
		return self.first + self.second

DistributedDataParallel new functionality and tutorials

nn.parallel.DistributedDataParallel: can now wrap multi-GPU modules, which enables use cases such as model parallel (tutorial) on one server and data parallel (tutorial) across servers.
(19271).

Breaking Changes

  • Tensor.set_: the device of a Tensor can no longer be changed via Tensor.set_. This would most commonly happen when setting up a Tensor with the default CUDA device and later swapping in a Storage on a different CUDA device. Instead, set up the Tensor on the correct device from the beginning. (18832).
  • Pay attention to the order change of lr_scheduler.step(). (7889).
  • torch.unique: changed the default value of sorted to True. (15379).
  • [JIT] Rename isTensor api -> isCompleteTensor. #18437
  • [JIT] Remove GraphExecutor's python bindings. #19141
  • [C++]: many methods on Type no longer exist; use the functional or Tensor method equivalent. (17991).
  • [C++]: the Backend constructor of TensorOptions no longer exists. (18137).
  • [C++, Distributed]: Remove c10d ProcessGroup::getGroupRank has been removed. (19147).

New Features

Operators

  • torch.tril_indices, torch.triu_indices: added operator with same behavior as NumPy. (14904, 15203).
  • torch.combinations, torch.cartesian_prod: added new itertools-like operators. (9393).
  • torch.repeat_interleave: new operator similar to numpy.repeat. (18395).
  • torch.from_file: new operator similar to Storage.from_file, but returning a tensor. (18688).
  • torch.unique_consecutive: new operator with semantics similar to std::unique in C++. (19060).
  • torch.tril, torch.triu, torch.trtrs: now support batching. (15257, 18025).
  • torch.gather: add support for sparse_grad option. (17182).
  • torch.std, torch.max_values, torch.min_values, torch.logsumexp can now operate over multiple dimensions at once. (14535, 15892, 16475).
  • torch.cdist: added operator equivalent to scipy.spatial.distance.cdist. (16168, 17173).
  • torch.__config__.show(): reports detailed version of all libraries. (18579).

NN

  • nn.MultiheadedAttention: new module implementing MultiheadedAttention from Attention Is All You Need. (18334).
  • nn.functional.interpolate: added support for bicubic. (9849).
  • nn.SyncBatchNorm: support synchronous Batch Normalization. (14267).
  • nn.Conv: added support for Circular Padding via mode='circular'. (17240).
  • nn.EmbeddingBag: now supports trainable `per_sample_weights. (18799).
  • nn.EmbeddingBag: add support for from_pretrained method, as in nn.Embedding. (15273).
  • RNNs: automatically handle unsorted variable-length sequences via enforce_sorted. (15225).
  • nn.Identity: new module for easier model surgery. (19249).

Tensors / dtypes

  • torch.bool: added support for torch.bool dtype and Tensors with that dtype (1-byte storage). NumPy conversion is supported, but operations are currently limited. (16810).

Optim

  • optim.lr_scheduler.CyclicLR: Support for Cyclical Learning Rate and Momentum. (18001).
  • optim.lr_scheduler.CosineAnnealingWarmRestarts: new scheduler: Stochastic Gradient Descent with Warm Restarts). (17226).
  • Support multiple simultaneous LR schedulers. (14010)

Distributions

  • torch.distributions: now support multiple inheritance. (16772).

Samplers

  • quasirandom.SobolEngine: new sampler. (10505).

DistributedDataParallel

  • nn.parallel.DistributedDataParallel: now supports modules with unused parameters (e.g. control flow, like adaptive softmax, etc). (18251, 18953).

TorchScript and Tracer

  • Allow early returns from if-statements. (#154463)
  • Add an @ignore annotation, which statically tells the TorchScript compiler to ignore the Python function. (#16055)
  • Simple for...in loops on lists. (#16726)
  • Ellipses (...) in Tensor indexing. (#17763)
  • None in Tensor indexing. (#18615)
  • Support for basic list comprehensions. (#17267)
  • Add implicit unwrapping of optionals on if foo is not None. (#15587)
  • Tensors, ints, and floats will once again be implicitly cast to bool if used in a conditional. (#18755).
  • Implement to(), cpu(), and cuda() on ScriptModules. (#15340 , #15904)
  • Add support for various methods on lists: (clear(), pop(), reverse(), copy() , extend(),index(), count(), insert(), remove() ).
  • Add su...
Read more

Bug Fix Release

07 Feb 08:51
Compare
Choose a tag to compare

Note: our conda install commands have slightly changed. Version specifiers such as cuda100 in conda install pytorch cuda100 -c pytorch have changed to conda install pytorch cudatoolkit=10.0 -c pytorch

Breaking Changes

There are no breaking changes in this release.

Bug Fixes

Serious

  • Higher order gradients for CPU Convolutions have been fixed (regressed in 1.0.0 under MKL-DNN setting) #15686
  • Correct gradients for non-contiguous weights in CPU Convolutions #16301
  • Fix ReLU on CPU Integer Tensors by fixing vec256 inversions #15634
  • Fix bincount for non-contiguous Tensors #15109
  • Fix torch.norm on CPU for large Tensors #15602
  • Fix eq_ to do equality on GPU (was doing greater-equal due to a typo) (#15475)
  • Workaround a CuDNN bug that gave wrong results in certain strided convolution gradient setups
    • blacklist fft algorithms for strided dgrad (#16626)

Correctness

  • Fix cuda native loss_ctc for varying input length (#15798)
    • this avoids NaNs in variable length settings
  • C++ Frontend: Fix serialization (#15033)
    • Fixes a bug where (de-)/serializing a hierarchy of submodules where one submodule doesn't have any parameters, but its submodules do
  • Fix derivative for mvlgamma (#15049)
  • Fix numerical stability in log_prob for Gumbel distribution (#15878)
  • multinomial: fix detection and drawing of zero probability events (#16075)

Crashes

  • PyTorch binaries were crashing on AWS Lambda and a few other niche systems, stemming from CPUInfo handling certain warnings as errors. Updated CPUInfo with relevant fixes.
  • MKL-DNN is now statically built, to avoid conflicts with system versions
  • Allow ReadyQueue to handle empty tasks (#15791)
    • Fixes a segfault with a DataParallel + Checkpoint neural network setting
  • Avoid integer divide by zero error in index_put_ (#14984)
  • Fix for model inference crash on Win10 (#15919) (#16092)
  • Use CUDAGuard when serializing Tensors:
    • Before this change, torch.save and torch.load would initialize the CUDA context on GPU 0 if it hadn't been initialized already, even if the serialized tensors are only on GPU 1.
  • Fix error with handling scalars and rpow, for example 1 ^^ x, where x is a PyTorch scalar (#16687)
  • Switch to CUDA implementation instead of CuDNN if batch size >= 65536 for affine_grid (#16403)
    • CuDNN crashes when batch size >= 65536
  • [Distributed] TCP init method race condition fix (#15684)
  • [Distributed] Fix a memory leak in Gloo's CPU backend
  • [C++ Frontend] Fix LBFGS issue around using inplace ops (#16167)
  • [Hub] Fix github branch prefix v (#15552)
  • [Hub] url download bugfix for URLs served without Content-Length header

Performance

  • LibTorch binaries now ship with CuDNN enabled. Without this change, many folks saw significant perf differences while using LibTorch vs PyTorch, this should be fixed now. #14976
  • Make btriunpack work for high dimensional batches and faster than before (#15286)
  • improve performance of unique with inverse indices (#16145)
  • Re-enable OpenMP in binaries (got disabled because of a CMake refactor)

Other

  • create type hint stub files for module torch (#16089)
    • This will restore auto-complete functionality in PyCharm, VSCode etc.
  • Fix sum_to behavior with zero dimensions (#15796)
  • Match NumPy by considering NaNs to be larger than any number when sorting (#15886)
  • Fixes various error message / settings in dynamic weight GRU / LSTMs (#15766)
  • C++ Frontend: Make call operator on module holder call forward (#15831)
  • C++ Frontend: Add the normalize transform to the core library (#15891)
  • Fix bug in torch::load and unpack torch::optim::detail namespace (#15926)
  • Implements Batched upper triangular, lower triangular (#15257)
  • Add torch.roll to documentation (#14880)
  • (better errors) Add backend checks for batch norm (#15955)

JIT

  • Add better support for bools in the graph fuser (#15057)
  • Allow tracing with fork/wait (#15184)
  • improve script/no script save error (#15321)
  • Add self to Python printer reserved words (#15318)
  • Better error when torch.load-ing a JIT model (#15578)
  • fix select after chunk op (#15672)
  • Add script standard library documentation + cleanup (#14912)

JIT Compiler, Faster Distributed, C++ Frontend

07 Dec 19:19
Compare
Choose a tag to compare

Table of Contents

  • Highlights
    • JIT
    • Brand New Distributed Package
    • C++ Frontend [API Unstable]
    • Torch Hub
  • Breaking Changes
  • Additional New Features
    • N-dimensional empty tensors
    • New Operators
    • New Distributions
    • Sparse API Improvements
    • Additions to existing Operators and Distributions
  • Bug Fixes
    • Serious
    • Backwards Compatibility
    • Correctness
    • Error checking
    • Miscellaneous
  • Other Improvements
  • Deprecations
    • CPP Extensions
  • Performance
  • Documentation Improvements

Highlights

JIT

The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It allows for the creation of models that can run without a dependency on the Python interpreter and which can be optimized more aggressively. Using program annotations existing models can be transformed into Torch Script, a subset of Python that PyTorch can run directly. Model code is still valid Python code and can be debugged with the standard Python toolchain. PyTorch 1.0 provides two ways in which you can make your existing code compatible with the JIT, using torch.jit.trace or torch.jit.script. Once annotated, Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.

# Write in Python, run anywhere!
@torch.jit.script
def RNN(x, h, W_h, U_h, b_h):
  y = []
  for t in range(x.size(0)):
    h = torch.tanh(x[t] @ W_h + h @ U_h + b_h)
    y += [h]
  return torch.stack(y), h

As an example, see a tutorial on deploying a seq2seq model,
loading an exported model from C++, or browse the docs.

Brand New Distributed Package

The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by a brand new re-designed distributed library. The main highlights of the new library are:

  • New torch.distributed is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI.
  • Significant Distributed Data Parallel performance improvements especially for hosts with slower networks such as ethernet-based hosts
  • Adds async support for all distributed collective operations in the torch.distributed package.
  • Adds the following CPU ops in the Gloo backend: send, recv, reduce, all_gather, gather, scatter
  • Adds barrier op in the NCCL backend
  • Adds new_group support for the NCCL backend

C++ Frontend [API Unstable].

The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn, torch.optim, torch.data and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:

PythonC++
import torch

model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
loss.backward()
optimizer.step()
      
#include <torch/torch.h>

torch::nn::Linear model(5, 1);
torch::optim::SGD optimizer(model->parameters(), /*lr=*/0.1);
torch::Tensor prediction = model->forward(torch::randn({3, 5}));
auto loss = torch::mse_loss(prediction, torch::ones({3, 1}));
loss.backward();
optimizer.step();
      

We are releasing the C++ frontend marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for your research application, but still has some open construction sites that will stabilize over the next couple of releases. Some parts of the API may undergo breaking changes during this time.

See https://pytorch.org/cppdocs for detailed documentation on the greater PyTorch C++ API as well as the C++ frontend.

Torch Hub

Torch Hub is a pre-trained model repository designed to facilitate research reproducibility.

Torch Hub supports publishing pre-trained models (model definitions and pre-trained weights) to a github repository using a simple hubconf.py file; see hubconf for resnet models in pytorch/vision as an example. Once published, users can load the pre-trained models using the torch.hub.load API.

For more details, see the torch.hub documentation. Expect a more-detailed blog post introducing Torch Hub in the near future!

Breaking Changes

  • Indexing a 0-dimensional tensor will now throw an error instead of warn. Use tensor.item() instead. (#11679).
  • torch.legacy is removed. (#11823).
  • torch.masked_copy_ is removed, use torch.masked_scatter_ instead. (#9817).
  • Operations that result in 0 element tensors may return changed shapes.
    • Before: all 0 element tensors would collapse to shape (0,). For example, torch.nonzero is documented to return a tensor of shape (n,z), where n = number of nonzero elements and z = dimensions of the input, but would always return a Tensor of shape _(0,) when no nonzero elements existed.
    • Now: Operations return their documented shape.
      # Previously: all 0-element tensors are collapsed to shape (0,)
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], dtype=torch.int64)
      
      # Now, proper shape is returned
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], size=(0, 2), dtype=torch.int64)
      
  • Sparse tensor indices and values shape invariants are changed to be more consistent in the case of 0-element tensors. See link for more details. (#9279).
  • torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.
  • Some inter-type operations (e.g. *) between torch.Tensors and NumPy arrays will now favor dispatching to the torch variant. This may result in different return types. (#9651).
  • Implicit numpy conversion no longer implicitly moves a tensor to CPU. Therefore, you may have to explicitly move a CUDA tensor to CPU (tensor.to('cpu')) before an implicit conversion. (#10553).
  • torch.randint now defaults to using dtype torch.int64 rather than the default floating-point dtype. (#11040).
  • torch.tensor function with a Tensor argument now returns a detached Tensor (i.e. a Tensor where grad_fn is None). This more closely aligns with the intent of the function, which is to return a Tensor with copied data and no history. (#11061,
    #11815).
  • torch.nn.functional.multilabel_soft_margin_loss now returns Tensors of shape (N,) instead of (N, C) to match the behavior of torch.nn.MultiMarginLoss. In addition, it is more numerically stable.
    (#9965).
  • The result type of a torch.float16 0-dimensional tensor and a integer is now torch.float16 (was torch.float32 or torch.float64 depending on the dtype of the integer). (#11941).
  • Dirichlet and Categorical distributions no longer accept scalar parameters. (#11589).
  • CPP Extensions: Deprecated factory functions that accept a type a...
Read more

torch.jit, C++ API, c10d distributed

02 Oct 06:28
Compare
Choose a tag to compare
Pre-release

This is a pre-release preview, do not rely on the tag to have a fixed set of commits, or rely on the tag for anything practical / important

Table of Contents

Highlights

JIT

The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It includes a language called Torch Script (don't worry it is a subset of Python,
so you'll still be writing Python), and two ways in which you can make your existing code compatible with the JIT.
Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.

# Write in Python, run anywhere!
@torch.jit.script
def RNN(x, h, W_h, U_h, b_h):
  y = []
  for t in range(x.size(0)):
    h = torch.tanh(x[t] @ W_h + h @ U_h + b_h)
    y += [h]
  return torch.stack(y), h

As an example, see a tutorial on deploying a seq2seq model,
loading an exported model from C++, or browse the docs.

torch.distributed new "C10D" library

The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are:

  • C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI.
  • Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts
  • Adds async support for all distributed collective operations in the torch.distributed package.
  • Adds send and recv support in the Gloo backend

C++ Frontend [API Unstable].

The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn, torch.optim, torch.data and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:

PythonC++
import torch

model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
loss.backward()
optimizer.step()
      
#include <torch/torch.h>

torch::nn::Linear model(5, 1);
torch::optim::SGD optimizer(model->parameters(), /*lr=*/0.1);
torch::Tensor prediction = model->forward(torch::randn({3, 5}));
auto loss = torch::mse_loss(prediction, torch::ones({3, 1}));
loss.backward();
optimizer.step();
      

We are releasing the C++ frontend marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for your research application, but still has some open construction sites that will stabilize over the next month or two. Some parts of the API may undergo breaking changes during this time.

See https://pytorch.org/cppdocs for detailed documentation on the greater PyTorch C++ API as well as the C++ frontend.

Breaking Changes

  • Indexing a 0-dimensional tensor will now throw an error instead of warn. Use tensor.item() instead. (#11679).
  • torch.legacy is removed. (#11823).
  • torch.masked_copy_ is removed, use torch.masked_scatter_ instead. (#9817).
  • Operations that result in 0 element tensors may return changed shapes.
    • Before: all 0 element tensors would collapse to shape (0,). For example, torch.nonzero is documented to return a tensor of shape (n,z), where n = number of nonzero elements and z = dimensions of the input, but would always return a Tensor of shape _(0,) when no nonzero elements existed.
    • Now: Operations return their documented shape.
      # Previously: all 0-element tensors are collapsed to shape (0,)
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], dtype=torch.int64)
      
      # Now, proper shape is returned
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], size=(0, 2), dtype=torch.int64)
      
  • Sparse tensor indices and values shape invariants are changed to be more consistent in the case of 0-element tensors. See link for more details. (#9279).
  • torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.
  • Some inter-type operations (e.g. *) between torch.Tensors and NumPy arrays will now favor dispatching to the torch variant. This may result in different return types. (#9651).
  • Implicit numpy conversion no longer implicitly moves a tensor to CPU. Therefore, you may have to explicitly move a CUDA tensor to CPU (tensor.to('cpu')) before an implicit conversion. (#10553).
  • torch.randint now defaults to using dtype torch.int64 rather than the default floating-point dtype. (#11040).
  • torch.tensor function with a Tensor argument now returns a detached Tensor (i.e. a Tensor where grad_fn is None). This more closely aligns with the intent of the function, which is to return a Tensor with copied data and no history. (#11061,
    #11815).
  • torch.nn.functional.multilabel_soft_margin_loss now returns Tensors of shape (N,) instead of (N, C) to match the behavior of torch.nn.MultiMarginLoss. In addition, it is more numerically stable.
    (#9965).
  • The result type of a torch.float16 0-dimensional tensor and a integer is now torch.float16 (was torch.float32 or torch.float64 depending on the dtype of the integer). (#11941).
  • Dirichlet and Categorical distributions no longer accept scalar parameters. (#11589).
  • CPP Extensions: Deprecated factory functions that accept a type as the first argument and a size as a second argument argument have been removed. Instead, use the new-style factory functions that accept the size as the first argument and TensorOptions as the last argument. For example, replace your call to at::ones(torch::CPU(at::kFloat)), {2, 3}) with torch::ones({2, 3}, at::kCPU). This applies to the following functions:
    • arange, empty, eye, full, linspace, logspace, ones, rand, randint, randn, randperm, range, zeros.

Additional New Features

N-dimensional empty tensors

  • Tensors with 0 elements can now have an arbitrary number of dimensions and support indexing and other torch operations; previously, 0 element tensors were limited to shape (0,). (#9947). Example:
    >>> torch.empty((0, 2, 4, 0), dtype=torch.float64)
    tensor([], size=(0, 2, 4, 0), dtype=torch.float64)
    

New Operators

Read more

Spectral Norm, Adaptive Softmax, faster CPU ops, anomaly detection (NaNs, etc.), Lots of bug fixes, Python 3.7 and CUDA 9.2 support

26 Jul 19:09
Compare
Choose a tag to compare

Table of Contents

  • Breaking Changes
  • New Features
    • Neural Networks
      • Adaptive Softmax, Spectral Norm, etc.
    • Operators
      • torch.bincount, torch.as_tensor, ...
    • torch.distributions
      • Half Cauchy, Gamma Sampling, ...
    • Other
      • Automatic anomaly detection (detecting NaNs, etc.)
  • Performance
    • Faster CPU ops in a wide variety of cases
  • Other improvements
  • Bug Fixes
  • Documentation Improvements

Breaking Changes

  • torch.stft has changed its signature to be consistent with librosa #9497
    • Before: stft(signal, frame_length, hop, fft_size=None, normalized=False, onesided=True, window=None, pad_end=0)
    • After: stft(input, n_fft, hop_length=None, win_length=None, window=None, center=True, pad_mode='reflect', normalized=False, onesided=True)
    • torch.stft is also now using FFT internally and is much faster.
  • torch.slice is removed in favor of the tensor slicing notation #7924
  • torch.arange now does dtype inference: any floating-point argument is inferred to be the default dtype; all integer arguments are inferred to be int64. #7016
  • torch.nn.functional.embedding_bag's old signature embedding_bag(weight, input, ...) is deprecated, embedding_bag(input, weight, ...) (consistent with torch.nn.functional.embedding) should be used instead
  • torch.nn.functional.sigmoid and torch.nn.functional.tanh are deprecated in favor of torch.sigmoid and torch.tanh #8748
  • Broadcast behavior changed in an (very rare) edge case: [1] x [0] now broadcasts to [0] (used to be [1]) #9209

New Features

Neural Networks

  • Adaptive Softmax nn.AdaptiveLogSoftmaxWithLoss #5287

    >>> in_features = 1000
    >>> n_classes = 200
    >>> adaptive_softmax = nn.AdaptiveLogSoftmaxWithLoss(in_features, n_classes, cutoffs=[20, 100, 150])
    >>> adaptive_softmax
    AdaptiveLogSoftmaxWithLoss(
      (head): Linear(in_features=1000, out_features=23, bias=False)
      (tail): ModuleList(
        (0): Sequential(
          (0): Linear(in_features=1000, out_features=250, bias=False)
          (1): Linear(in_features=250, out_features=80, bias=False)
        )
        (1): Sequential(
          (0): Linear(in_features=1000, out_features=62, bias=False)
          (1): Linear(in_features=62, out_features=50, bias=False)
        )
        (2): Sequential(
          (0): Linear(in_features=1000, out_features=15, bias=False)
          (1): Linear(in_features=15, out_features=50, bias=False)
        )
      )
    )
    >>> batch = 15
    >>> input = torch.randn(batch, in_features)
    >>> target = torch.randint(n_classes, (batch,), dtype=torch.long)
    >>> # get the log probabilities of target given input, and mean negative log probability loss
    >>> adaptive_softmax(input, target) 
    ASMoutput(output=tensor([-6.8270, -7.9465, -7.3479, -6.8511, -7.5613, -7.1154, -2.9478, -6.9885,
            -7.7484, -7.9102, -7.1660, -8.2843, -7.7903, -8.4459, -7.2371],
           grad_fn=<ThAddBackward>), loss=tensor(7.2112, grad_fn=<MeanBackward1>))
    >>> # get the log probabilities of all targets given input as a (batch x n_classes) tensor
    >>> adaptive_softmax.log_prob(input)  
    tensor([[-2.6533, -3.3957, -2.7069,  ..., -6.4749, -5.8867, -6.0611],
            [-3.4209, -3.2695, -2.9728,  ..., -7.6664, -7.5946, -7.9606],
            [-3.6789, -3.6317, -3.2098,  ..., -7.3722, -6.9006, -7.4314],
            ...,
            [-3.3150, -4.0957, -3.4335,  ..., -7.9572, -8.4603, -8.2080],
            [-3.8726, -3.7905, -4.3262,  ..., -8.0031, -7.8754, -8.7971],
            [-3.6082, -3.1969, -3.2719,  ..., -6.9769, -6.3158, -7.0805]],
           grad_fn=<CopySlices>)
    >>> # predit: get the class that maximize log probaility for each input
    >>> adaptive_softmax.predict(input)  
    tensor([ 8,  6,  6, 16, 14, 16, 16,  9,  4,  7,  5,  7,  8, 14,  3])
  • Add spectral normalization nn.utils.spectral_norm #6929

    >>> # Usage is similar to weight_norm
    >>> convT = nn.ConvTranspose2d(3, 64, kernel_size=3, pad=1)
    >>> # Can specify number of power iterations applied each time, or use default (1)
    >>> convT = nn.utils.spectral_norm(convT, n_power_iterations=2)
    >>>
    >>> # apply to every conv and conv transpose module in a model
    >>> def add_sn(m):
            for name, c in m.named_children():
                 m.add_module(name, add_sn(c))    
             if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)):
                 return nn.utils.spectral_norm(m)
             else:
                 return m
    
    >>> my_model = add_sn(my_model)
  • nn.ModuleDict and nn.ParameterDict containers #8463

  • Add nn.init.zeros_ and nn.init.ones_ #7488

  • Add sparse gradient option to pretrained embedding #7492

  • Add max pooling support to nn.EmbeddingBag #5725

  • Depthwise convolution support for MKLDNN #8782

  • Add nn.FeatureAlphaDropout (featurewise Alpha Dropout layer) #9073

Operators

Read more

Trade-off memory for compute, Windows support, 24 distributions with cdf, variance etc., dtypes, zero-dimensional Tensors, Tensor-Variable merge, , faster distributed, perf and bug fixes, CuDNN 7.1

24 Apr 20:49
Compare
Choose a tag to compare

PyTorch 0.4.0 release notes

Table of Contents

  • Major Core Changes
    • Tensor / Variable merged
    • Zero-dimensional Tensors
    • dtypes
    • migration guide
  • New Features
    • Tensors
      • Full support for advanced indexing
      • Fast Fourier Transforms
    • Neural Networks
      • Trade-off memory for compute
      • bottleneck - a tool to identify hotspots in your code
    • torch.distributions
      • 24 basic probability distributions
      • Added cdf, variance, entropy, perplexity etc.
    • Distributed Training
      • Launcher utility for ease of use
      • NCCL2 backend
    • C++ Extensions
    • Windows Support
    • ONNX Improvements
      • RNN support
  • Performance improvements
  • Bug fixes

Major Core changes

Here is a summary of the updates to the most important core features users will use daily.

Major Changes and Potentially Breaking Changes:

  • Tensors and Variables have merged
  • Some operations now return 0-dimensional (scalar) Tensors
  • Deprecation of the volatile flag

Improvements:

  • dtypes, devices, and Numpy-style Tensor creation functions added
  • Support for writing device-agnostic code

We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.

Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.

The contents of this section (Major Core changes) are included in the migration guide.

Merging Tensor and Variable classes

torch.autograd.Variable and torch.Tensor are now the same class. More precisely, torch.Tensor is capable of tracking history and behaves like the old Variable; Variable wrapping continues to work as before but returns an object of type torch.Tensor. This means that you don't need the Variable wrapper everywhere in your code anymore.

The type() of a Tensor has changed

Note also that the type() of a Tensor no longer reflects the data type. Use isinstance() or x.type() instead:

>>> x = torch.DoubleTensor([1, 1, 1])
>>> print(type(x)) # was torch.DoubleTensor
<class 'torch.autograd.variable.Variable'>
>>> print(x.type())  # OK: 'torch.DoubleTensor'
'torch.DoubleTensor'
>>> print(isinstance(x, torch.DoubleTensor))  # OK: True
True

When does autograd start tracking history now?

requires_grad, the central flag for autograd, is now an attribute on Tensors. Let's see how this change manifests in code.

autograd uses the same rules previously used for Variables. It starts tracking history when any input Tensor of an operation has requires_grad=True. For example,

>>> x = torch.ones(1)  # create a tensor with requires_grad=False (default)
>>> x.requires_grad
False
>>> y = torch.ones(1)  # another tensor with requires_grad=False
>>> z = x + y
>>> # both inputs have requires_grad=False. so does the output
>>> z.requires_grad
False
>>> # then autograd won't track this computation. let's verify!
>>> z.backward()
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
>>>
>>> # now create a tensor with requires_grad=True
>>> w = torch.ones(1, requires_grad=True)
>>> w.requires_grad
True
>>> # add to the previous result that has require_grad=False
>>> total = w + z
>>> # the total sum now requires grad!
>>> total.requires_grad
True
>>> # autograd can compute the gradients as well
>>> total.backward()
>>> w.grad
tensor([ 1.])
>>> # and no computation is wasted to compute gradients for x, y and z, which don't require grad
>>> z.grad == x.grad == y.grad == None
True
Manipulating requires_grad flag

Other than directly setting the attribute, you can change this flag in-place using my_tensor.requires_grad_(requires_grad=True), or, as in the above example, at creation time by passing it in as an argument (default is False), e.g.,

>>> existing_tensor.requires_grad_()
>>> existing_tensor.requires_grad
True
>>> my_tensor = torch.zeros(3, 4, requires_grad=True)
>>> my_tensor.requires_grad
True

What about .data?

.data was the primary way to get the underlying Tensor from a Variable. After this merge, calling y = x.data still has similar semantics. So y will be a Tensor that shares the same data with x, is unrelated with the computation history of x, and has requires_grad=False.

However, .data can be unsafe in some cases. Any changes on x.data wouldn't be tracked by autograd, and the computed gradients would be incorrect if x is needed in a backward pass. A safer alternative is to use x.detach(), which also returns a Tensor that shares data with requires_grad=False, but will have its in-place changes reported by autograd if x is needed in backward.

Some operations now return 0-dimensional (scalar) Tensors

Previously, indexing into a Tensor vector (1-dimensional tensor) gave a Python number but indexing into a Variable vector gave (incosistently!) a vector of size (1,)! Similar behavior existed with reduction functions, i.e. tensor.sum() would return a Python number, but variable.sum() would retun a vector of size (1,).

Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch! Scalars can be created using the new torch.tensor function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent of numpy.array). Now you can do things like:

>>> torch.tensor(3.1416)         # create a scalar directly
tensor(3.1416)
>>> torch.tensor(3.1416).size()  # scalar is 0-dimensional
torch.Size([])
>>> torch.tensor([3]).size()     # compare to a vector of size 1
torch.Size([1])
>>>
>>> vector = torch.arange(2, 6)  # this is a vector
>>> vector
tensor([ 2.,  3.,  4.,  5.])
>>> vector.size()
torch.Size([4])
>>> vector[3]                    # indexing into a vector gives a scalar
tensor(5.)
>>> vector[3].item()             # .item() gives the value as a Python number
5.0
>>> sum = torch.tensor([2, 3]).sum()
>>> sum
tensor(5)
>>> sum.size()
torch.Size([])

Accumulating losses

Consider the widely used pattern total_loss += loss.data[0] before 0.4.0. loss was a Variable wrapping a tensor of size (1,), but in 0.4.0 loss is now a scalar and has 0 dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): use loss.item() to get the Python number from a scalar.

Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.

Deprecation of volatile flag

The volatile flag is now deprecated and has no effect. Previously, any computation that involves a Variable with volatile=True won't be tracked by autograd. This has now been replaced by a set of more flexible context managers including torch.no_grad(), torch.set_grad_enabled(grad_mode), and others.

>>> x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
...     y = x * 2
>>> y.requires_grad
False
>>>
>>> is_train = False
>>> with torch.set_grad_enabled(is_train):
...     y = x * 2
>>> y.requires_grad
False
>>> torch.set_grad_enabled(True)  # this can also be used as a function
>>> y = x * 2
>>> y.requires_grad
True
>>> torch.set_grad_enabled(False)
>>> y = x * 2
>>> y.requires_grad
False

dtypes, devices and NumPy-style creation functions

In previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example, torch.cuda.sparse.DoubleTensor was the Tensor type respresentingdouble data type, living on CUDA devices, and with COO sparse tensor layout.

In this release, we introduce torch.dtype, [torch.device](http://pyto...

Read more

Bug fixes and performance improvements

14 Feb 00:36
Compare
Choose a tag to compare

Binaries

  • Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
  • Stop binary releases for CUDA 7.5
  • Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.

As always, links to our binaries are on http://pytorch.org

New features

Bug Fixes

Data Loader / Datasets / Multiprocessing

  • Made DataLoader workers more verbose on bus error and segfault. Additionally, add a timeout option to the DataLoader, which will error if sample loading time exceeds the given value. #3474
  • DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of fork syscall. Now, each worker will have it's RNG seed set to base_seed + worker_id where base_seed is a random int64 value generated by the parent process. You may use torch.initial_seed() to access this value in worker_init_fn, which can be used to set other seeds (e.g. NumPy) before data loading. worker_init_fn is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018
  • Add additional signal handling in DataLoader worker processes when workers abruptly die.
  • Negative value for n_workers now gives a ValueError #4019
  • fixed a typo in ConcatDataset.cumulative_sizes attribute name #3534
  • Accept longs in default_collate for dataloader in python 2 #4001
  • Re-initialize autograd engine in child processes #4158
  • Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196

CUDA / CuDNN

  • allow cudnn for fp16 batch norm #4021
  • Use enabled argument in torch.autograd.profiler.emit_nvtx (was being ignored) #4032
  • Fix cuBLAS arguments for fp16 torch.dot #3660
  • Fix CUDA index_fill_ boundary check with small tensor size #3953
  • Fix CUDA Multinomial checks #4009
  • Fix CUDA version typo in warning #4175
  • Initialize cuda before setting cuda tensor types as default #4788
  • Add missing lazy_init in cuda python module #4907
  • Lazy init order in set device, should not be called in getDevCount #4918
  • Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936

CPU

  • Assert MKL ld* conditions for ger, gemm, and gemv #4056

torch operators

  • Fix tensor.repeat when the underlying storage is not owned by torch (for example, coming from numpy) #4084
  • Add proper shape checking to torch.cat #4087
  • Add check for slice shape match in index_copy_ and index_add_. #4342
  • Fix use after free when advanced indexing tensors with tensors #4559
  • Fix triu and tril for zero-strided inputs on gpu #4962
  • Fix blas addmm (gemm) condition check #5048
  • Fix topk work size computation #5053
  • Fix reduction functions to respect the stride of the output #4995
  • Improve float precision stability of linspace op, fix 4419. #4470

autograd

  • Fix python gc race condition with THPVariable_traverse #4437

nn layers

  • Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
    Fix cosine_similarity's output shape #3811
  • Add rnn args check #3925
  • NLLLoss works for arbitrary dimensions #4654
  • More strict shape check on Conv operators #4637
  • Fix maxpool3d / avgpool3d crashes #5052
  • Fix setting using running stats in InstanceNorm*d #4444

Multi-GPU

  • Fix DataParallel scattering for empty lists / dicts / tuples #3769
  • Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
  • Broadcast output requires_grad only if corresponding input requires_grad #5061

core

  • Remove hard file offset reset in load() #3695
  • Have sizeof account for size of stored elements #3821
  • Fix undefined FileNotFoundError #4384
  • make torch.set_num_threads also set MKL threads (take 2) #5002

others

  • Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656

Performance improvements

  • slightly simplified math in IndexToOffset #4040
  • improve performance of maxpooling backwards #4106
  • Add cublas batched gemm support. #4151
  • Rearrange dimensions for pointwise operations for better performance. #4174
  • Improve memory access patterns for index operations. #4493
  • Improve CUDA softmax performance #4973
  • Fixed double memory accesses of several pointwise operations. #5068

Documentation and UX Improvements

  • Better error messages for blas ops with cuda.LongTensor #4160
  • Add missing trtrs, orgqr, ormqr docs #3720
  • change doc for Adaptive Pooling #3746
  • Fix MultiLabelMarginLoss docs #3836
  • More docs for Conv1d Conv2d #3870
  • Improve Tensor.scatter_ doc #3937
  • [docs] rnn.py: Note zero defaults for hidden state/cell #3951
  • Improve Tensor.new doc #3954
  • Improve docs for torch and torch.Tensor #3969
  • Added explicit tuple dimensions to doc for Conv1d. #4136
  • Improve svd doc #4155
  • Correct instancenorm input size #4171
  • Fix StepLR example docs #4478

Performance improvements, new layers, ship models to other frameworks (via ONNX), CUDA9, CuDNNv7, lots of bug fixes

05 Dec 01:57
Compare
Choose a tag to compare

Table of contents

  • Breaking changes: removed reinforce()
  • New features
    • Unreduced losses
    • A profiler for the autograd engine
    • More functions support Higher order gradients
    • New features in Optimizers
    • New layers and nn functionality
    • New Tensor functions and Features
    • Other additions
  • API changes
  • Performance improvements
    • Big reduction in framework overhead (helps small models)
    • 4x to 256x faster Softmax/LogSoftmax
    • More...
  • Framework Interoperability
    • DLPack Interoperability
    • Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
  • Bug Fixes (a lot of them)

Breaking changes

Stochastic functions, i.e. Variable.reinforce() were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.

We introduce the torch.distributions package to replace Stochastic functions.

Your previous code typically looked like this:

probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()

This is the new equivalent code:

probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

New features

Unreduced losses

Now, Some loss functions can compute per-sample losses in a mini-batch

  • By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
  • Now, a subset of loss functions allow specifying reduce=False to return individual losses for each sample in the mini-batch
  • Example: loss = nn.CrossEntropyLoss(..., reduce=False)
  • Currently supported losses: MSELoss, NLLLoss, NLLLoss2d, KLDivLoss, CrossEntropyLoss, SmoothL1Loss, L1Loss
  • More loss functions will be covered in the next release

An in-built Profiler in the autograd engine

We built a low-level profiler to help you identify bottlenecks in your models

Let us start with an example:

>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
--------------------------------  ----------  ---------
Name                               CPU time   CUDA time
-------------------------------   ----------  ---------
PowConstant                        142.036us    0.000us
N5torch8autograd9GraphRootE         63.524us    0.000us
PowConstantBackward                184.228us    0.000us
MulConstant                         50.288us    0.000us
PowConstant                         28.439us    0.000us
Mul                                 20.154us    0.000us
N5torch8autograd14AccumulateGradE   13.790us    0.000us
N5torch8autograd5CloneE              4.088us    0.000us

The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof prefix. For example:

nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>

# in python
>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

Then, you can load trace_name.prof in PyTorch and print a summary profile report.

>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)

Read additional documentation here

Higher order gradients

Added higher-order gradients support for the following layers

  • ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
  • PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
  • MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
  • DataParallel

Optimizers

  • optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
    • In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
  • Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.

New layers and nn functionality

  • Added AdpativeMaxPool3d and AdaptiveAvgPool3d
  • Added LPPool1d
  • F.pad now has support for:
    • 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
    • constant padding on n-d signals
  • nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in nearest and linear modes.
  • grid_sample now allows padding with the border value via padding_mode="border". grid_sample expects a grid in the range of [-1, 1], and if the values are out of these bounds, padding with the value 0.0 is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.
  • Introducing nn.utils.parameters_to_vector and nn.utils.vector_to_parameters
    • parameters_to_vector takes net.parameters() and return a 1D vector that contains all the parameters
    • vector_to_parameters takes a vector of flattened parameters and copies the values over to a network's parameters
    • Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
  • Allow user to not specify certain input dimensions for AdaptivePool*d and infer them at runtime.
    • For example:
    # target output size of 10x7
    m = nn.AdaptiveMaxPool2d((None, 7))
  • DataParallel container on CPU is now a no-op (instead of erroring out)

New Tensor functions and features

  • Introduced torch.erf and torch.erfinv that compute the error function and the inverse error function of each element in the Tensor.
  • adds broadcasting support to bitwise operators
  • Added Tensor.put_ and torch.take similar to numpy.take and numpy.put.
    • The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
      first. The output has the same shape as the indices.
    • The put function copies value into a tensor also using linear indices.
    • Differences from numpy equivalents:
      • numpy.take has an optional axis argument, which behaves like index_select. This axis argument is not yet present.
      • numpy.put repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
  • add zeros and zeros_like for sparse Tensors.
  • 1-element Tensors can now be casted to Python scalars. For example: int(torch.Tensor([5])) works now.

Other additions

  • Added torch.cuda.get_device_name and torch.cuda.get_device_capability that do what the names say. Example:
    >>> torch.cuda.get_device_name(0)
    'Quadro GP100'
    >>> torch.cuda.get_device_capability(0)
    (6, 0)
  • If one sets torch.backends.cudnn.deterministic = True, then the CuDNN convolutions use deterministic algorithms
  • torch.cuda_get_rng_state_all and torch.cuda_set_rng_state_all are introduced to let you save / load the state of the random number generator over all GPUs at once
  • torch.cuda.emptyCache() frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.

API changes

  • softmax and log_softmax now take a dim argument that specifies the dimension in which slices are taken for the softmax operation. dim allows negative dimensions as well (dim = -1 will be the last dimension)
  • torch.potrf (Cholesky decomposition) is now differentiable and defined on Variable
  • Remove all instances of device_id and replace it with device, to make things consistent
  • torch.autograd.grad now allows you to specify inputs that are unused in the autograd graph if you use allow_unused=True
    This gets useful when using torch.autograd.grad in large graphs with lists of inputs / outputs
    For example:
    x, y = Variable(...), Variable(...)
    torch.autograd.grad(x * 2, [x, y]) # errors
    torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
  • pad_packed_sequence now allows a padding_value argument that can be used instead of zero-padding
  • Dataset now has a + operator (which uses ConcatDataset). You can do something like MNIST(...) + FashionMNIST(...) for example, and you will get a concatenated dataset containing samples from both.
  • torch.distributed.recv allows Tensors to be received from any sender (hence, src is optional). recv returns the...
Read more

Higher order gradients, Distributed PyTorch, Broadcasting, Advanced Indexing, New Layers and more

28 Aug 14:43
Compare
Choose a tag to compare

Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
Package documentation for this release is available at http://pytorch.org/docs/0.2.0/

We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.

Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.

Table of contents:

  • Tensor Broadcasting (numpy-style)
  • Advanced Indexing for Tensors and Variables
  • Higher-order gradients
  • Distributed PyTorch (multi-node training, etc.)
  • Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
  • New in torch and autograd: matmul, inverse, etc.
  • Easier debugging, better error messages
  • Bug Fixes
  • Important Breakages and Workarounds

Tensor Broadcasting (numpy-style)

In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).

PyTorch Broadcasting semantics closely follow numpy-style broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.

General Semantics

Two tensors are “broadcastable” if the following rules hold:

  • Each tensor has at least one dimension.
  • When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

For Example:

>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)

# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor(  3,1,1)

# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist

# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:

  • If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
  • Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.

For Example:

# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])

# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.

Advanced Indexing for Tensors and Variables

PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same []-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...] functions.

Let's look at some examples:

x = torch.Tensor(5, 5, 5)

Pure Integer Array Indexing - specify arbitrary indices at each dimension

x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])

also supports broadcasting, duplicates

x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])

arbitrary indexer shapes allowed

x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
                         [x[0][0][1], x[1][0][1]]]

can use colon, ellipse

x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]

also use Tensors to index!

y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]

selection with less than ndim, note the use of comma

x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]

Higher order gradients

Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.

In the 0.2 release, we've enabled the ability to compute higher order gradients for all of torch.XXX functions and the most popular nnlayers. The rest will be covered in the next release.

Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.

import torch
from torchvision.models import resnet18
from torch.autograd import Variable

model = resnet18().cuda()

# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())

# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)

grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.

# now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
    grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()

# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes
grad_norm.backward()

# do an optimization step
optimizer.step()

We see two new concepts here:

  1. torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the .grad attributes. This is useful if you want to further operate on the gradients.
  2. You can operate on the gradients, and call backward() on them.

The list of nn layers that support higher order gradients are:

  • AvgPool*d, BatchNorm*d, Conv*d, MaxPool1d,2d, Linear, Bilinear
  • pad, ConstantPad2d, ZeroPad2d, LPPool2d, PixelShuffle
  • ReLU6, LeakyReLU, PReLU, Tanh, Tanhshrink, Threshold, Sigmoid, HardTanh, ELU, Softsign, SeLU
  • L1Loss, NLLLoss, PoissonNLLLoss, LogSoftmax, Softmax2d
    The rest will be enabled in the next release.

To enable higher order gradients, we've introduced a new style of writing autograd.Function (the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.

Most of you dont write your own autograd.Functions, they are low-level primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.

Distributed PyTorch

We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

The distributed package follows an MPI-style programming model. This means that there are functions provided to you such as send, recv, all_reduce that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:

  • shared file system (requires that all processes can access a single file system)
  • IP multicast (requires that all processes are in the same network)
  • environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)

Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:

import torch.distributed as dist

dist.init_process_group(backend='tcp',
                        init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
                        world_size=4)

print('Hello from process {} (out of {})!'.format(
        dist.get_rank(), dist.get_world_size()))

This would print Hello from process 2 (out of 4)on the 3rd machine.

World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.

Here's a snippet that shows how simple point-to-point communication can be performed:

# All proces...
Read more