Releases: pytorch/pytorch
New TorchScript API with Improved Python Language Coverage, Expanded ONNX Export, NN.Transformer
We have just released PyTorch v1.2.0.
It has over 1,900 commits and contains a significant amount of effort in areas spanning JIT, ONNX, Distributed, as well as Performance and Eager Frontend Improvements.
Highlights
[JIT] New TorchScript API
Version 1.2 includes a new, easier-to-use API for converting nn.Module
s into ScriptModule
s. A sample usage is:
class MyModule(torch.nn.Module):
...
# Construct an nn.Module instance
module = MyModule(args)
# Pass it to `torch.jit.script` to compile it into a ScriptModule.
my_torchscript_module = torch.jit.script(module)
torch.jit.script()
will attempt to recursively compile the given nn.Module
, including any submodules or methods called from forward()
. See the migration guide for more info on what's changed and how to migrate.
[JIT] Improved TorchScript Python language coverage
In 1.2, TorchScript has significantly improved its support for Python language constructs and Python's standard library. Highlights include:
- Early returns, breaks and continues.
- Iterator-based constructs, like
for..in
loops,zip()
, andenumerate()
. NamedTuples
.math
andstring
library support.- Support for most Python builtin functions.
See the detailed notes below for more information.
Expanded Onnx Export
In PyTorch 1.2, working with Microsoft, we’ve added full support to export ONNX Opset versions 7(v1.2), 8(v1.3), 9(v1.4) and 10 (v1.5). We’ve have also enhanced the constant folding pass to support Opset 10, the latest available version of ONNX. Additionally, users now are able to register their own symbolic to export custom ops, and specify the dynamic dimensions of inputs during export. Here is a summary of the all of the major improvements:
- Support for multiple Opsets including the ability to export dropout, slice, flip and interpolate in Opset 10.
- Improvements to ScriptModule including support for multiple outputs, tensor factories and tuples as inputs and outputs.
- More than a dozen additional PyTorch operators supported including the ability to export a custom operator.
Updated docs can be found here and also a refreshed tutorial using ONNXRuntime can be found here.
Tensorboard is no Longer Considered Experimental
Read the documentation or simply type from
torch.utils.tensorboard
import
SummaryWriter
to get started!
NN.Transformer
We include a standard nn.Transformer module, based on the paper “Attention is All You Need”. The nn.Transformer
module relies entirely on an attention mechanism to draw global dependencies between input and output. The individual components of the nn.Transformer
module are designed so they can be adopted independently. For example, the nn.TransformerEncoder can be used by itself, without the larger nn.Transformer
. New APIs include:
nn.Transformer
nn.TransformerEncoder
andnn.TransformerEncoderLayer
nn.TransformerDecoder
andnn.TransformerDecoderLayer
See the Transformer Layers documentation for more info.
Breaking Changes
Comparison operations (lt (<), le (<=), gt (>), ge (>=), eq (==), ne, (!=)
) return dtype has changed from torch.uint8
to torch.bool
(21113)
Version 1.1:
>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([1, 0, 0], dtype=torch.uint8)
Version 1.2:
>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([True, False, False])
For most programs, we don't expect that any changes will need to be made as a result of this change. There are a couple of possible exceptions listed below.
Mask Inversion
In prior versions of PyTorch, the idiomatic way to invert a mask was to call 1 - mask
. This behavior is no longer supported; use the ~
or bitwise_not()
operator instead.
Version 1.1:
>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([0, 1, 1], dtype=torch.uint8)
Version 1.2:
>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported.
If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.
>>> ~(torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([False, True, True])
sum(Tensor) (python built-in) does not upcast dtype
like torch.sum
Python's built-in sum
returns results in the same dtype
as the tensor itself, so it will not return the expected result if the value of the sum cannot be represented in the dtype
of the tensor.
Version 1.1:
# value can be represented in result dtype
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)
tensor(3, dtype=torch.uint8)
# value can NOT be represented in result dtype
>>> sum(torch.ones((300,)) > 0)
tensor(44, dtype=torch.uint8)
# torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)
tensor(300)
Version 1.2:
# value cannot be represented in result dtype (now torch.bool)
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)
tensor(True)
# value cannot be represented in result dtype
>>> sum(torch.ones((300,)) > 0)
tensor(True)
# torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)
tensor(300)
TLDR: use torch.sum
instead of the built-in sum
. Note that the built-in sum()
behavior will more closely resemble torch.sum
in the next release.
Note also that masking via torch.uint8
Tensors is now deprecated, see the Deprecations section for more information.
__invert__
/ ~
: now calls torch.bitwise_not
instead of 1 - tensor
and is supported for all integral+Boolean dtypes instead of only torch.uint8
. (22326)
Version 1.1:
>>> ~torch.arange(8, dtype=torch.uint8)
tensor([ 1, 0, 255, 254, 253, 252, 251, 250], dtype=torch.uint8)
Version 1.2:
>>> ~torch.arange(8, dtype=torch.uint8)
tensor([255, 254, 253, 252, 251, 250, 249, 248], dtype=torch.uint8)
torch.tensor(bool)
and torch.as_tensor(bool)
now infer torch.bool
dtype instead of torch.uint8
. (19097)
Version 1.1:
>>> torch.tensor([True, False])
tensor([1, 0], dtype=torch.uint8)
Version 1.2:
>>> torch.tensor([True, False])
tensor([ True, False])
nn.BatchNorm{1,2,3}D
: gamma (weight
) is now initialized to all 1s rather than randomly initialized from U(0, 1). (13774)
Version 1.1:
>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:
tensor([0.1635, 0.7512, 0.4130, 0.6875, 0.5496],
requires_grad=True)
Version 1.2:
>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:
tensor([1., 1., 1., 1., 1.], requires_grad=True)
A number of deprecated Linear Algebra operators have been removed (22841)
Removed | Use Instead |
---|---|
btrifact |
lu |
btrifact_with_info |
lu with get_infos=True |
btrisolve |
lu_solve |
btriunpack |
lu_unpack |
gesv |
solve |
pstrf |
cholesky |
potrf |
cholesky |
potri |
cholesky_inverse |
potrs |
cholesky_solve |
trtrs |
triangular_solve |
Sparse Tensors: Changing the sparsity of a Tensor through .data
is no longer supported. (17072)
>>> x = torch.randn(2,3)
>>> x.data = torch.sparse_coo_tensor((2, 3))
RuntimeError: Attempted to call `variable.set_data(tensor)`,
but `variable` and `tensor` have incompatible tensor type.
Sparse Tensors: in-place shape modifications of Dense Tensor Constructor Arguments will no longer modify the Sparse Tensor itself (20614)
Version 1.1:
>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)
>>> s.coalesce().indices().shape
torch.Size([1, 1])
>>> s.coalesce().values().shape
torch.Size([1])
Notice indices()
and values()
reflect the resized tensor shapes.
Version 1.2:
>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)
>>> s.coalesce().indices().shape
torch.Size([1, 2])
>>> s.coalesce().values().shape
torch.Size([2])
Notice indices()
and values()
reflect the original tensor shapes.
Sparse Tensors: Accumulating dense gradients into a sparse .grad
will no longer retain Python object identity. (17072)
Version 1.1:
>>> m = torch.nn.Embedding(10, 3, sparse=True)
>>> m(torch.tensor([[1,2,4,5],[4,3,2,9]])).sum().backward()
>>> assert m.weight.grad.layout == torch.sparse_coo
>>> m_weight_grad_saved = m.weight.grad
# accumulate dense gradient into sparse .grad, change sparsity
>>> m.weigh...
Official TensorBoard Support, Attributes, Dicts, Lists and User-defined types in JIT / TorchScript, Improved Distributed
Note: CUDA 8.0 is no longer supported
Highlights
TensorBoard (currently experimental)
First-class and native support for visualization and model debugging with TensorBoard, a web application suite for inspecting and understanding training runs, tensors, and graphs. PyTorch now supports TensorBoard logging with a simple from torch.utils.tensorboard import SummaryWriter
command. Histograms, embeddings, scalars, images, text, graphs, and more can be visualized across training runs. TensorBoard support is currently experimental. You can browse the docs here.
[JIT] Attributes in ScriptModules
Attributes can be assigned on a ScriptModule
by wrapping them with torch.jit.Attribute
and specifying the type. Attributes are similar to parameters or buffers, but can be of any type. They will be serialized along with any paramters/buffers when you call torch.jit.save()
, so they are a great way to store arbitrary state in your model. See the docs for more info.
Example:
class Foo(torch.jit.ScriptModule):
def __init__(self, a_dict):
super(Foo, self).__init__(False)
self.words = torch.jit.Attribute([], List[str])
self.some_dict = torch.jit.Attribute(a_dict, Dict[str, int])
@torch.jit.script_method
def forward(self, input: str) -> int:
self.words.append(input)
return self.some_dict[input]
[JIT] Dictionary and List Support in TorchScript
TorchScript now has robust support for list and dictionary types. They behave much like Python lists and dictionaries, supporting most built-in methods, as well as simple comprehensions and for…in
constructs.
[JIT] User-defined classes in TorchScript (experimental)
For more complex stateful operations, TorchScript now supports annotating a class with @torch.jit.script
. Classes used this way can be JIT-compiled and loaded in C++ like other TorchScript modules. See the docs for more info.
@torch.jit.script
class Pair:
def __init__(self, first, second)
self.first = first
self.second = second
def sum(self):
return self.first + self.second
DistributedDataParallel new functionality and tutorials
nn.parallel.DistributedDataParallel
: can now wrap multi-GPU modules, which enables use cases such as model parallel (tutorial) on one server and data parallel (tutorial) across servers.
(19271).
Breaking Changes
Tensor.set_
: thedevice
of a Tensor can no longer be changed viaTensor.set_
. This would most commonly happen when setting up a Tensor with the default CUDA device and later swapping in aStorage
on a different CUDA device. Instead, set up the Tensor on the correct device from the beginning. (18832).- Pay attention to the order change of
lr_scheduler.step()
. (7889). torch.unique
: changed the default value ofsorted
toTrue
. (15379).- [JIT] Rename isTensor api -> isCompleteTensor. #18437
- [JIT] Remove GraphExecutor's python bindings. #19141
- [C++]: many methods on
Type
no longer exist; use the functional or Tensor method equivalent. (17991). - [C++]: the
Backend
constructor ofTensorOptions
no longer exists. (18137). - [C++, Distributed]: Remove c10d
ProcessGroup::getGroupRank
has been removed. (19147).
New Features
Operators
torch.tril_indices
,torch.triu_indices
: added operator with same behavior as NumPy. (14904, 15203).torch.combinations
,torch.cartesian_prod
: added newitertools
-like operators. (9393).torch.repeat_interleave
: new operator similar tonumpy.repeat
. (18395).torch.from_file
: new operator similar toStorage.from_file
, but returning a tensor. (18688).torch.unique_consecutive
: new operator with semantics similar tostd::unique
in C++. (19060).torch.tril
,torch.triu
,torch.trtrs
: now support batching. (15257, 18025).torch.gather
: add support forsparse_grad
option. (17182).torch.std
,torch.max_values
,torch.min_values
,torch.logsumexp
can now operate over multiple dimensions at once. (14535, 15892, 16475).torch.cdist
: added operator equivalent toscipy.spatial.distance.cdist
. (16168, 17173).torch.__config__.show()
: reports detailed version of all libraries. (18579).
NN
nn.MultiheadedAttention
: new module implementing MultiheadedAttention fromAttention Is All You Need
. (18334).nn.functional.interpolate
: added support forbicubic
. (9849).nn.SyncBatchNorm
: support synchronous Batch Normalization. (14267).nn.Conv
: added support for Circular Padding viamode='circular'
. (17240).nn.EmbeddingBag
: now supports trainable `per_sample_weights. (18799).nn.EmbeddingBag
: add support forfrom_pretrained
method, as innn.Embedding
. (15273).RNNs
: automatically handle unsorted variable-length sequences viaenforce_sorted
. (15225).nn.Identity
: new module for easier model surgery. (19249).
Tensors / dtypes
torch.bool
: added support fortorch.bool
dtype and Tensors with that dtype (1-byte storage). NumPy conversion is supported, but operations are currently limited. (16810).
Optim
optim.lr_scheduler.CyclicLR
: Support for Cyclical Learning Rate and Momentum. (18001).optim.lr_scheduler.CosineAnnealingWarmRestarts
: new scheduler: Stochastic Gradient Descent with Warm Restarts). (17226).- Support multiple simultaneous LR schedulers. (14010)
Distributions
torch.distributions
: now support multiple inheritance. (16772).
Samplers
quasirandom.SobolEngine
: new sampler. (10505).
DistributedDataParallel
nn.parallel.DistributedDataParallel
: now supports modules with unused parameters (e.g. control flow, like adaptive softmax, etc). (18251, 18953).
TorchScript and Tracer
- Allow early returns from if-statements. (#154463)
- Add an
@ignore
annotation, which statically tells the TorchScript compiler to ignore the Python function. (#16055) - Simple
for...in
loops on lists. (#16726) - Ellipses (
...
) in Tensor indexing. (#17763) None
in Tensor indexing. (#18615)- Support for basic list comprehensions. (#17267)
- Add implicit unwrapping of optionals on
if foo is not None
. (#15587) - Tensors, ints, and floats will once again be implicitly cast to bool if used in a conditional. (#18755).
- Implement
to()
,cpu()
, andcuda()
on ScriptModules. (#15340 , #15904) - Add support for various methods on lists: (
clear()
,pop()
,reverse()
,copy()
,extend()
,index()
,count()
,insert()
,remove()
). - Add su...
Bug Fix Release
Note: our conda install commands have slightly changed. Version specifiers such as cuda100
in conda install pytorch cuda100 -c pytorch
have changed to conda install pytorch cudatoolkit=10.0 -c pytorch
Breaking Changes
There are no breaking changes in this release.
Bug Fixes
Serious
- Higher order gradients for CPU Convolutions have been fixed (regressed in 1.0.0 under MKL-DNN setting) #15686
- Correct gradients for non-contiguous weights in CPU Convolutions #16301
- Fix ReLU on CPU Integer Tensors by fixing vec256 inversions #15634
- Fix bincount for non-contiguous Tensors #15109
- Fix torch.norm on CPU for large Tensors #15602
- Fix eq_ to do equality on GPU (was doing greater-equal due to a typo) (#15475)
- Workaround a CuDNN bug that gave wrong results in certain strided convolution gradient setups
- blacklist fft algorithms for strided dgrad (#16626)
Correctness
- Fix cuda native loss_ctc for varying input length (#15798)
- this avoids NaNs in variable length settings
- C++ Frontend: Fix serialization (#15033)
- Fixes a bug where (de-)/serializing a hierarchy of submodules where one submodule doesn't have any parameters, but its submodules do
- Fix derivative for mvlgamma (#15049)
- Fix numerical stability in log_prob for Gumbel distribution (#15878)
- multinomial: fix detection and drawing of zero probability events (#16075)
Crashes
- PyTorch binaries were crashing on AWS Lambda and a few other niche systems, stemming from CPUInfo handling certain warnings as errors. Updated CPUInfo with relevant fixes.
- MKL-DNN is now statically built, to avoid conflicts with system versions
- Allow ReadyQueue to handle empty tasks (#15791)
- Fixes a segfault with a DataParallel + Checkpoint neural network setting
- Avoid integer divide by zero error in index_put_ (#14984)
- Fix for model inference crash on Win10 (#15919) (#16092)
- Use CUDAGuard when serializing Tensors:
- Before this change,
torch.save
andtorch.load
would initialize the CUDA context on GPU 0 if it hadn't been initialized already, even if the serialized tensors are only on GPU 1.
- Before this change,
- Fix error with handling scalars and rpow, for example
1 ^^ x
, where x is a PyTorch scalar (#16687) - Switch to CUDA implementation instead of CuDNN if batch size >= 65536 for affine_grid (#16403)
- CuDNN crashes when batch size >= 65536
- [Distributed] TCP init method race condition fix (#15684)
- [Distributed] Fix a memory leak in Gloo's CPU backend
- [C++ Frontend] Fix LBFGS issue around using inplace ops (#16167)
- [Hub] Fix github branch prefix v (#15552)
- [Hub] url download bugfix for URLs served without Content-Length header
Performance
- LibTorch binaries now ship with CuDNN enabled. Without this change, many folks saw significant perf differences while using LibTorch vs PyTorch, this should be fixed now. #14976
- Make btriunpack work for high dimensional batches and faster than before (#15286)
- improve performance of unique with inverse indices (#16145)
- Re-enable OpenMP in binaries (got disabled because of a CMake refactor)
Other
- create type hint stub files for module torch (#16089)
- This will restore auto-complete functionality in PyCharm, VSCode etc.
- Fix sum_to behavior with zero dimensions (#15796)
- Match NumPy by considering NaNs to be larger than any number when sorting (#15886)
- Fixes various error message / settings in dynamic weight GRU / LSTMs (#15766)
- C++ Frontend: Make call operator on module holder call forward (#15831)
- C++ Frontend: Add the normalize transform to the core library (#15891)
- Fix bug in torch::load and unpack torch::optim::detail namespace (#15926)
- Implements Batched upper triangular, lower triangular (#15257)
- Add torch.roll to documentation (#14880)
- (better errors) Add backend checks for batch norm (#15955)
JIT
- Add better support for bools in the graph fuser (#15057)
- Allow tracing with fork/wait (#15184)
- improve script/no script save error (#15321)
- Add self to Python printer reserved words (#15318)
- Better error when torch.load-ing a JIT model (#15578)
- fix select after chunk op (#15672)
- Add script standard library documentation + cleanup (#14912)
JIT Compiler, Faster Distributed, C++ Frontend
Table of Contents
- Highlights
- JIT
- Brand New Distributed Package
- C++ Frontend [API Unstable]
- Torch Hub
- Breaking Changes
- Additional New Features
- N-dimensional empty tensors
- New Operators
- New Distributions
- Sparse API Improvements
- Additions to existing Operators and Distributions
- Bug Fixes
- Serious
- Backwards Compatibility
- Correctness
- Error checking
- Miscellaneous
- Other Improvements
- Deprecations
- CPP Extensions
- Performance
- Documentation Improvements
Highlights
JIT
The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It allows for the creation of models that can run without a dependency on the Python interpreter and which can be optimized more aggressively. Using program annotations existing models can be transformed into Torch Script, a subset of Python that PyTorch can run directly. Model code is still valid Python code and can be debugged with the standard Python toolchain. PyTorch 1.0 provides two ways in which you can make your existing code compatible with the JIT, using torch.jit.trace
or torch.jit.script
. Once annotated, Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.
# Write in Python, run anywhere!
@torch.jit.script
def RNN(x, h, W_h, U_h, b_h):
y = []
for t in range(x.size(0)):
h = torch.tanh(x[t] @ W_h + h @ U_h + b_h)
y += [h]
return torch.stack(y), h
As an example, see a tutorial on deploying a seq2seq model,
loading an exported model from C++, or browse the docs.
Brand New Distributed Package
The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by a brand new re-designed distributed library. The main highlights of the new library are:
- New
torch.distributed
is performance driven and operates entirely asynchronously for all backends:Gloo
,NCCL
, andMPI
. - Significant Distributed Data Parallel performance improvements especially for hosts with slower networks such as ethernet-based hosts
- Adds async support for all distributed collective operations in the torch.distributed package.
- Adds the following CPU ops in the Gloo backend: send, recv, reduce, all_gather, gather, scatter
- Adds barrier op in the NCCL backend
- Adds new_group support for the NCCL backend
C++ Frontend [API Unstable].
The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn
, torch.optim
, torch.data
and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:
Python | C++ |
---|---|
import torch
model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
loss.backward()
optimizer.step()
|
#include <torch/torch.h>
torch::nn::Linear model(5, 1);
torch::optim::SGD optimizer(model->parameters(), /*lr=*/0.1);
torch::Tensor prediction = model->forward(torch::randn({3, 5}));
auto loss = torch::mse_loss(prediction, torch::ones({3, 1}));
loss.backward();
optimizer.step();
|
We are releasing the C++ frontend marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for your research application, but still has some open construction sites that will stabilize over the next couple of releases. Some parts of the API may undergo breaking changes during this time.
See https://pytorch.org/cppdocs for detailed documentation on the greater PyTorch C++ API as well as the C++ frontend.
Torch Hub
Torch Hub is a pre-trained model repository designed to facilitate research reproducibility.
Torch Hub supports publishing pre-trained models (model definitions and pre-trained weights) to a github repository using a simple hubconf.py file; see hubconf for resnet models in pytorch/vision as an example. Once published, users can load the pre-trained models using the torch.hub.load API.
For more details, see the torch.hub documentation. Expect a more-detailed blog post introducing Torch Hub in the near future!
Breaking Changes
- Indexing a 0-dimensional tensor will now throw an error instead of warn. Use tensor.item() instead. (#11679).
- torch.legacy is removed. (#11823).
- torch.masked_copy_ is removed, use torch.masked_scatter_ instead. (#9817).
- Operations that result in 0 element tensors may return changed shapes.
- Before: all 0 element tensors would collapse to shape (0,). For example, torch.nonzero is documented to return a tensor of shape (n,z), where n = number of nonzero elements and z = dimensions of the input, but would always return a Tensor of shape _(0,) when no nonzero elements existed.
- Now: Operations return their documented shape.
# Previously: all 0-element tensors are collapsed to shape (0,) >>> torch.nonzero(torch.zeros(2, 3)) tensor([], dtype=torch.int64) # Now, proper shape is returned >>> torch.nonzero(torch.zeros(2, 3)) tensor([], size=(0, 2), dtype=torch.int64)
- Sparse tensor indices and values shape invariants are changed to be more consistent in the case of 0-element tensors. See link for more details. (#9279).
- torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.
- Some inter-type operations (e.g.
*
) betweentorch.Tensors
and NumPy arrays will now favor dispatching to thetorch
variant. This may result in different return types. (#9651). - Implicit
numpy
conversion no longer implicitly moves a tensor to CPU. Therefore, you may have to explicitly move a CUDA tensor to CPU (tensor.to('cpu')
) before an implicit conversion. (#10553). - torch.randint now defaults to using dtype torch.int64 rather than the default floating-point dtype. (#11040).
- torch.tensor function with a
Tensor
argument now returns adetached
Tensor (i.e. a Tensor wheregrad_fn
isNone
). This more closely aligns with the intent of the function, which is to return a Tensor with copied data and no history. (#11061,
#11815). - torch.nn.functional.multilabel_soft_margin_loss now returns Tensors of shape
(N,)
instead of(N, C)
to match the behavior of torch.nn.MultiMarginLoss. In addition, it is more numerically stable.
(#9965). - The result type of a torch.float16 0-dimensional tensor and a integer is now torch.float16 (was torch.float32 or torch.float64 depending on the dtype of the integer). (#11941).
- Dirichlet and Categorical distributions no longer accept scalar parameters. (#11589).
- CPP Extensions: Deprecated factory functions that accept a type a...
torch.jit, C++ API, c10d distributed
This is a pre-release preview, do not rely on the tag to have a fixed set of commits, or rely on the tag for anything practical / important
Table of Contents
- Highlights
- Breaking Changes
- Bug Fixes
- Other Improvements
- Deprecations
- Performance
- Documentation Improvements
Highlights
JIT
The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It includes a language called Torch Script (don't worry it is a subset of Python,
so you'll still be writing Python), and two ways in which you can make your existing code compatible with the JIT.
Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.
# Write in Python, run anywhere!
@torch.jit.script
def RNN(x, h, W_h, U_h, b_h):
y = []
for t in range(x.size(0)):
h = torch.tanh(x[t] @ W_h + h @ U_h + b_h)
y += [h]
return torch.stack(y), h
As an example, see a tutorial on deploying a seq2seq model,
loading an exported model from C++, or browse the docs.
torch.distributed new "C10D" library
The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are:
- C10D is performance driven and operates entirely asynchronously for all backends:
Gloo
,NCCL
, andMPI
. - Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts
- Adds async support for all distributed collective operations in the torch.distributed package.
- Adds send and recv support in the Gloo backend
C++ Frontend [API Unstable].
The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn
, torch.optim
, torch.data
and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:
Python | C++ |
---|---|
import torch
model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
loss.backward()
optimizer.step()
|
#include <torch/torch.h>
torch::nn::Linear model(5, 1);
torch::optim::SGD optimizer(model->parameters(), /*lr=*/0.1);
torch::Tensor prediction = model->forward(torch::randn({3, 5}));
auto loss = torch::mse_loss(prediction, torch::ones({3, 1}));
loss.backward();
optimizer.step();
|
We are releasing the C++ frontend marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for your research application, but still has some open construction sites that will stabilize over the next month or two. Some parts of the API may undergo breaking changes during this time.
See https://pytorch.org/cppdocs for detailed documentation on the greater PyTorch C++ API as well as the C++ frontend.
Breaking Changes
- Indexing a 0-dimensional tensor will now throw an error instead of warn. Use tensor.item() instead. (#11679).
- torch.legacy is removed. (#11823).
- torch.masked_copy_ is removed, use torch.masked_scatter_ instead. (#9817).
- Operations that result in 0 element tensors may return changed shapes.
- Before: all 0 element tensors would collapse to shape (0,). For example, torch.nonzero is documented to return a tensor of shape (n,z), where n = number of nonzero elements and z = dimensions of the input, but would always return a Tensor of shape _(0,) when no nonzero elements existed.
- Now: Operations return their documented shape.
# Previously: all 0-element tensors are collapsed to shape (0,) >>> torch.nonzero(torch.zeros(2, 3)) tensor([], dtype=torch.int64) # Now, proper shape is returned >>> torch.nonzero(torch.zeros(2, 3)) tensor([], size=(0, 2), dtype=torch.int64)
- Sparse tensor indices and values shape invariants are changed to be more consistent in the case of 0-element tensors. See link for more details. (#9279).
- torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.
- Some inter-type operations (e.g.
*
) betweentorch.Tensors
and NumPy arrays will now favor dispatching to thetorch
variant. This may result in different return types. (#9651). - Implicit
numpy
conversion no longer implicitly moves a tensor to CPU. Therefore, you may have to explicitly move a CUDA tensor to CPU (tensor.to('cpu')
) before an implicit conversion. (#10553). - torch.randint now defaults to using dtype torch.int64 rather than the default floating-point dtype. (#11040).
- torch.tensor function with a
Tensor
argument now returns adetached
Tensor (i.e. a Tensor wheregrad_fn
isNone
). This more closely aligns with the intent of the function, which is to return a Tensor with copied data and no history. (#11061,
#11815). - torch.nn.functional.multilabel_soft_margin_loss now returns Tensors of shape
(N,)
instead of(N, C)
to match the behavior of torch.nn.MultiMarginLoss. In addition, it is more numerically stable.
(#9965). - The result type of a torch.float16 0-dimensional tensor and a integer is now torch.float16 (was torch.float32 or torch.float64 depending on the dtype of the integer). (#11941).
- Dirichlet and Categorical distributions no longer accept scalar parameters. (#11589).
- CPP Extensions: Deprecated factory functions that accept a type as the first argument and a size as a second argument argument have been removed. Instead, use the new-style factory functions that accept the size as the first argument and
TensorOptions
as the last argument. For example, replace your call toat::ones(torch::CPU(at::kFloat)), {2, 3})
withtorch::ones({2, 3}, at::kCPU)
. This applies to the following functions:arange
,empty
,eye
,full
,linspace
,logspace
,ones
,rand
,randint
,randn
,randperm
,range
,zeros
.
Additional New Features
N-dimensional empty tensors
- Tensors with 0 elements can now have an arbitrary number of dimensions and support indexing and other torch operations; previously, 0 element tensors were limited to shape (0,). (#9947). Example:
>>> torch.empty((0, 2, 4, 0), dtype=torch.float64) tensor([], size=(0, 2, 4, 0), dtype=torch.float64)
New Operators
- [torch.argsort](https://pytor...
Spectral Norm, Adaptive Softmax, faster CPU ops, anomaly detection (NaNs, etc.), Lots of bug fixes, Python 3.7 and CUDA 9.2 support
Table of Contents
- Breaking Changes
- New Features
- Neural Networks
- Adaptive Softmax, Spectral Norm, etc.
- Operators
- torch.bincount, torch.as_tensor, ...
- torch.distributions
- Half Cauchy, Gamma Sampling, ...
- Other
- Automatic anomaly detection (detecting NaNs, etc.)
- Neural Networks
- Performance
- Faster CPU ops in a wide variety of cases
- Other improvements
- Bug Fixes
- Documentation Improvements
Breaking Changes
torch.stft
has changed its signature to be consistent with librosa #9497- Before:
stft(signal, frame_length, hop, fft_size=None, normalized=False, onesided=True, window=None, pad_end=0)
- After:
stft(input, n_fft, hop_length=None, win_length=None, window=None, center=True, pad_mode='reflect', normalized=False, onesided=True)
torch.stft
is also now using FFT internally and is much faster.
- Before:
torch.slice
is removed in favor of the tensor slicing notation #7924torch.arange
now does dtype inference: any floating-point argument is inferred to be the defaultdtype
; all integer arguments are inferred to beint64
. #7016torch.nn.functional.embedding_bag
's old signature embedding_bag(weight, input, ...) is deprecated, embedding_bag(input, weight, ...) (consistent with torch.nn.functional.embedding) should be used insteadtorch.nn.functional.sigmoid
andtorch.nn.functional.tanh
are deprecated in favor oftorch.sigmoid
andtorch.tanh
#8748- Broadcast behavior changed in an (very rare) edge case:
[1] x [0]
now broadcasts to[0]
(used to be[1]
) #9209
New Features
Neural Networks
-
Adaptive Softmax
nn.AdaptiveLogSoftmaxWithLoss
#5287>>> in_features = 1000 >>> n_classes = 200 >>> adaptive_softmax = nn.AdaptiveLogSoftmaxWithLoss(in_features, n_classes, cutoffs=[20, 100, 150]) >>> adaptive_softmax AdaptiveLogSoftmaxWithLoss( (head): Linear(in_features=1000, out_features=23, bias=False) (tail): ModuleList( (0): Sequential( (0): Linear(in_features=1000, out_features=250, bias=False) (1): Linear(in_features=250, out_features=80, bias=False) ) (1): Sequential( (0): Linear(in_features=1000, out_features=62, bias=False) (1): Linear(in_features=62, out_features=50, bias=False) ) (2): Sequential( (0): Linear(in_features=1000, out_features=15, bias=False) (1): Linear(in_features=15, out_features=50, bias=False) ) ) ) >>> batch = 15 >>> input = torch.randn(batch, in_features) >>> target = torch.randint(n_classes, (batch,), dtype=torch.long) >>> # get the log probabilities of target given input, and mean negative log probability loss >>> adaptive_softmax(input, target) ASMoutput(output=tensor([-6.8270, -7.9465, -7.3479, -6.8511, -7.5613, -7.1154, -2.9478, -6.9885, -7.7484, -7.9102, -7.1660, -8.2843, -7.7903, -8.4459, -7.2371], grad_fn=<ThAddBackward>), loss=tensor(7.2112, grad_fn=<MeanBackward1>)) >>> # get the log probabilities of all targets given input as a (batch x n_classes) tensor >>> adaptive_softmax.log_prob(input) tensor([[-2.6533, -3.3957, -2.7069, ..., -6.4749, -5.8867, -6.0611], [-3.4209, -3.2695, -2.9728, ..., -7.6664, -7.5946, -7.9606], [-3.6789, -3.6317, -3.2098, ..., -7.3722, -6.9006, -7.4314], ..., [-3.3150, -4.0957, -3.4335, ..., -7.9572, -8.4603, -8.2080], [-3.8726, -3.7905, -4.3262, ..., -8.0031, -7.8754, -8.7971], [-3.6082, -3.1969, -3.2719, ..., -6.9769, -6.3158, -7.0805]], grad_fn=<CopySlices>) >>> # predit: get the class that maximize log probaility for each input >>> adaptive_softmax.predict(input) tensor([ 8, 6, 6, 16, 14, 16, 16, 9, 4, 7, 5, 7, 8, 14, 3])
-
Add spectral normalization
nn.utils.spectral_norm
#6929>>> # Usage is similar to weight_norm >>> convT = nn.ConvTranspose2d(3, 64, kernel_size=3, pad=1) >>> # Can specify number of power iterations applied each time, or use default (1) >>> convT = nn.utils.spectral_norm(convT, n_power_iterations=2) >>> >>> # apply to every conv and conv transpose module in a model >>> def add_sn(m): for name, c in m.named_children(): m.add_module(name, add_sn(c)) if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)): return nn.utils.spectral_norm(m) else: return m >>> my_model = add_sn(my_model)
-
nn.ModuleDict
andnn.ParameterDict
containers #8463 -
Add
nn.init.zeros_
andnn.init.ones_
#7488 -
Add sparse gradient option to pretrained embedding #7492
-
Add max pooling support to
nn.EmbeddingBag
#5725 -
Depthwise convolution support for MKLDNN #8782
-
Add
nn.FeatureAlphaDropout
(featurewise Alpha Dropout layer) #9073
Operators
-
torch.bincount
(count frequency of each value in an integral tensor) #6688>>> input = torch.randint(0, 8, (5,), dtype=torch.int64) >>> weights = torch.linspace(0, 1, steps=5) >>> input, weights (tensor([4, 3, 6, 3, 4]), tensor([ 0.0000, 0.2500, 0.5000, 0.7500, 1.0000]) >>> torch.bincount(input) tensor([0, 0, 0, 2, 2, 0, 1]) >>> input.bincount(weights) tensor([0.0000, 0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 0.5000])
-
torch.as_tensor
(similar totorch.tensor
but never copies unless necessary) #7109>>> tensor = torch.randn(3, device='cpu', dtype=torch.float32) >>> torch.as_tensor(tensor) # doesn't copy >>> torch.as_tensor(tensor, dtype=torch.float64) # copies due to incompatible dtype >>> torch.as_tensor(tensor, device='cuda') # copies due to incompatible device >>> array = np.array([3, 4.5]) >>> torch.as_tensor(array) # doesn't copy, sharing memory with the numpy array >>> torch.as_tensor(array, device='cuda') # copies due to incompatible device
-
torch.randperm
for CUDA tensors #7606 -
nn.HardShrink
for CUDA tensors #8117 -
torch.flip
(flips a tensor along specified dims) #7873 -
torch.flatten
(flattens a contiguous range of dims) #8578 -
torch.pinverse
(computes svd-based pseudo-inverse) #9052 -
torch.unique
for CUDA tensors #8899 -
torch.erfc
(complementary error function) https://github.com/pytorch/pytorch/pull/9366/files -
Support backward for target tensor in
torch.nn.functional.kl_div
#7839 -
Add batched linear solver to
torch.gesv
#6100 -
torch.sum
now supports summing over multiple dimensions https://github.com/pytorch/pytorch/pull/6152/files -
torch.diagonal
[torch.diagflat
](https:...
Trade-off memory for compute, Windows support, 24 distributions with cdf, variance etc., dtypes, zero-dimensional Tensors, Tensor-Variable merge, , faster distributed, perf and bug fixes, CuDNN 7.1
PyTorch 0.4.0 release notes
Table of Contents
- Major Core Changes
- Tensor / Variable merged
- Zero-dimensional Tensors
- dtypes
- migration guide
- New Features
- Tensors
- Full support for advanced indexing
- Fast Fourier Transforms
- Neural Networks
- Trade-off memory for compute
- bottleneck - a tool to identify hotspots in your code
- torch.distributions
- 24 basic probability distributions
- Added cdf, variance, entropy, perplexity etc.
- Distributed Training
- Launcher utility for ease of use
- NCCL2 backend
- C++ Extensions
- Windows Support
- ONNX Improvements
- RNN support
- Tensors
- Performance improvements
- Bug fixes
Major Core changes
Here is a summary of the updates to the most important core features users will use daily.
Major Changes and Potentially Breaking Changes:
Tensors
andVariables
have merged- Some operations now return 0-dimensional (scalar)
Tensors
- Deprecation of the
volatile
flag
Improvements:
dtypes
,devices
, and Numpy-styleTensor
creation functions added- Support for writing device-agnostic code
We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
The contents of this section (Major Core changes) are included in the migration guide.
Merging Tensor
and Variable
classes
torch.autograd.Variable
and torch.Tensor
are now the same class. More precisely, torch.Tensor
is capable of tracking history and behaves like the old Variable
; Variable
wrapping continues to work as before but returns an object of type torch.Tensor
. This means that you don't need the Variable
wrapper everywhere in your code anymore.
The type()
of a Tensor
has changed
Note also that the type()
of a Tensor no longer reflects the data type. Use isinstance()
or x.type()
instead:
>>> x = torch.DoubleTensor([1, 1, 1])
>>> print(type(x)) # was torch.DoubleTensor
<class 'torch.autograd.variable.Variable'>
>>> print(x.type()) # OK: 'torch.DoubleTensor'
'torch.DoubleTensor'
>>> print(isinstance(x, torch.DoubleTensor)) # OK: True
True
When does autograd
start tracking history now?
requires_grad
, the central flag for autograd
, is now an attribute on Tensor
s. Let's see how this change manifests in code.
autograd
uses the same rules previously used for Variable
s. It starts tracking history when any input Tensor
of an operation has requires_grad=True
. For example,
>>> x = torch.ones(1) # create a tensor with requires_grad=False (default)
>>> x.requires_grad
False
>>> y = torch.ones(1) # another tensor with requires_grad=False
>>> z = x + y
>>> # both inputs have requires_grad=False. so does the output
>>> z.requires_grad
False
>>> # then autograd won't track this computation. let's verify!
>>> z.backward()
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
>>>
>>> # now create a tensor with requires_grad=True
>>> w = torch.ones(1, requires_grad=True)
>>> w.requires_grad
True
>>> # add to the previous result that has require_grad=False
>>> total = w + z
>>> # the total sum now requires grad!
>>> total.requires_grad
True
>>> # autograd can compute the gradients as well
>>> total.backward()
>>> w.grad
tensor([ 1.])
>>> # and no computation is wasted to compute gradients for x, y and z, which don't require grad
>>> z.grad == x.grad == y.grad == None
True
Manipulating requires_grad
flag
Other than directly setting the attribute, you can change this flag in-place using my_tensor.requires_grad_(requires_grad=True)
, or, as in the above example, at creation time by passing it in as an argument (default is False
), e.g.,
>>> existing_tensor.requires_grad_()
>>> existing_tensor.requires_grad
True
>>> my_tensor = torch.zeros(3, 4, requires_grad=True)
>>> my_tensor.requires_grad
True
What about .data
?
.data
was the primary way to get the underlying Tensor
from a Variable
. After this merge, calling y = x.data
still has similar semantics. So y
will be a Tensor
that shares the same data with x
, is unrelated with the computation history of x
, and has requires_grad=False
.
However, .data
can be unsafe in some cases. Any changes on x.data
wouldn't be tracked by autograd
, and the computed gradients would be incorrect if x
is needed in a backward pass. A safer alternative is to use x.detach()
, which also returns a Tensor
that shares data with requires_grad=False
, but will have its in-place changes reported by autograd
if x
is needed in backward.
Some operations now return 0-dimensional (scalar) Tensors
Previously, indexing into a Tensor
vector (1-dimensional tensor) gave a Python number but indexing into a Variable
vector gave (incosistently!) a vector of size (1,)
! Similar behavior existed with reduction functions, i.e. tensor.sum()
would return a Python number, but variable.sum()
would retun a vector of size (1,)
.
Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch! Scalars can be created using the new torch.tensor
function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent of numpy.array
). Now you can do things like:
>>> torch.tensor(3.1416) # create a scalar directly
tensor(3.1416)
>>> torch.tensor(3.1416).size() # scalar is 0-dimensional
torch.Size([])
>>> torch.tensor([3]).size() # compare to a vector of size 1
torch.Size([1])
>>>
>>> vector = torch.arange(2, 6) # this is a vector
>>> vector
tensor([ 2., 3., 4., 5.])
>>> vector.size()
torch.Size([4])
>>> vector[3] # indexing into a vector gives a scalar
tensor(5.)
>>> vector[3].item() # .item() gives the value as a Python number
5.0
>>> sum = torch.tensor([2, 3]).sum()
>>> sum
tensor(5)
>>> sum.size()
torch.Size([])
Accumulating losses
Consider the widely used pattern total_loss += loss.data[0]
before 0.4.0. loss
was a Variable
wrapping a tensor of size (1,)
, but in 0.4.0 loss
is now a scalar and has 0
dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): use loss.item()
to get the Python number from a scalar.
Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.
Deprecation of volatile
flag
The volatile
flag is now deprecated and has no effect. Previously, any computation that involves a Variable
with volatile=True
won't be tracked by autograd
. This has now been replaced by a set of more flexible context managers including torch.no_grad()
, torch.set_grad_enabled(grad_mode)
, and others.
>>> x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
... y = x * 2
>>> y.requires_grad
False
>>>
>>> is_train = False
>>> with torch.set_grad_enabled(is_train):
... y = x * 2
>>> y.requires_grad
False
>>> torch.set_grad_enabled(True) # this can also be used as a function
>>> y = x * 2
>>> y.requires_grad
True
>>> torch.set_grad_enabled(False)
>>> y = x * 2
>>> y.requires_grad
False
dtypes
, devices
and NumPy-style creation functions
In previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example, torch.cuda.sparse.DoubleTensor
was the Tensor
type respresentingdouble
data type, living on CUDA devices, and with COO sparse tensor layout.
In this release, we introduce torch.dtype
, [torch.device
](http://pyto...
Bug fixes and performance improvements
Binaries
- Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
- Stop binary releases for CUDA 7.5
- Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.
As always, links to our binaries are on http://pytorch.org
New features
- Add Cosine Annealing Learning Rate Scheduler #3311
- add
reduce
argument toPoissonNLLLoss
to be able to compute unreduced losses #3770 - Allow
target.requires_grad=True
inl1_loss
andmse_loss
(compute loss wrttarget
) #3876 - Add
random_split
that randomly splits a dataset into non-overlapping new datasets of given lengths #4435 - Introduced scopes to annotate ONNX graphs to have better TensorBoard visualization of models #5153
Allowmap_location
intorch.load
to be a string, such asmap_location='cpu'
ormap_location='cuda:2'
#4203
Bug Fixes
Data Loader / Datasets / Multiprocessing
- Made DataLoader workers more verbose on bus error and segfault. Additionally, add a
timeout
option to the DataLoader, which will error if sample loading time exceeds the given value. #3474 - DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of
fork
syscall. Now, each worker will have it's RNG seed set tobase_seed + worker_id
wherebase_seed
is a random int64 value generated by the parent process. You may usetorch.initial_seed()
to access this value inworker_init_fn
, which can be used to set other seeds (e.g. NumPy) before data loading.worker_init_fn
is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018 - Add additional signal handling in DataLoader worker processes when workers abruptly die.
- Negative value for n_workers now gives a ValueError #4019
- fixed a typo in
ConcatDataset.cumulative_sizes
attribute name #3534 - Accept longs in default_collate for dataloader in python 2 #4001
- Re-initialize autograd engine in child processes #4158
- Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196
CUDA / CuDNN
- allow cudnn for fp16 batch norm #4021
- Use
enabled
argument intorch.autograd.profiler.emit_nvtx
(was being ignored) #4032 - Fix cuBLAS arguments for fp16
torch.dot
#3660 - Fix CUDA index_fill_ boundary check with small tensor size #3953
- Fix CUDA Multinomial checks #4009
- Fix CUDA version typo in warning #4175
- Initialize cuda before setting cuda tensor types as default #4788
- Add missing lazy_init in cuda python module #4907
- Lazy init order in set device, should not be called in getDevCount #4918
- Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936
CPU
- Assert MKL ld* conditions for ger, gemm, and gemv #4056
torch operators
- Fix
tensor.repeat
when the underlying storage is not owned bytorch
(for example, coming from numpy) #4084 - Add proper shape checking to torch.cat #4087
- Add check for slice shape match in index_copy_ and index_add_. #4342
- Fix use after free when advanced indexing tensors with tensors #4559
- Fix triu and tril for zero-strided inputs on gpu #4962
- Fix blas addmm (gemm) condition check #5048
- Fix topk work size computation #5053
- Fix reduction functions to respect the stride of the output #4995
- Improve float precision stability of
linspace
op, fix 4419. #4470
autograd
- Fix python gc race condition with THPVariable_traverse #4437
nn layers
- Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
Fix cosine_similarity's output shape #3811 - Add rnn args check #3925
- NLLLoss works for arbitrary dimensions #4654
- More strict shape check on Conv operators #4637
- Fix maxpool3d / avgpool3d crashes #5052
- Fix setting using running stats in InstanceNorm*d #4444
Multi-GPU
- Fix DataParallel scattering for empty lists / dicts / tuples #3769
- Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
- Broadcast output requires_grad only if corresponding input requires_grad #5061
core
- Remove hard file offset reset in load() #3695
- Have sizeof account for size of stored elements #3821
- Fix undefined FileNotFoundError #4384
- make torch.set_num_threads also set MKL threads (take 2) #5002
others
- Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656
Performance improvements
- slightly simplified math in IndexToOffset #4040
- improve performance of maxpooling backwards #4106
- Add cublas batched gemm support. #4151
- Rearrange dimensions for pointwise operations for better performance. #4174
- Improve memory access patterns for index operations. #4493
- Improve CUDA softmax performance #4973
- Fixed double memory accesses of several pointwise operations. #5068
Documentation and UX Improvements
- Better error messages for blas ops with cuda.LongTensor #4160
- Add missing trtrs, orgqr, ormqr docs #3720
- change doc for Adaptive Pooling #3746
- Fix MultiLabelMarginLoss docs #3836
- More docs for Conv1d Conv2d #3870
- Improve Tensor.scatter_ doc #3937
- [docs] rnn.py: Note zero defaults for hidden state/cell #3951
- Improve Tensor.new doc #3954
- Improve docs for torch and torch.Tensor #3969
- Added explicit tuple dimensions to doc for Conv1d. #4136
- Improve svd doc #4155
- Correct instancenorm input size #4171
- Fix StepLR example docs #4478
Performance improvements, new layers, ship models to other frameworks (via ONNX), CUDA9, CuDNNv7, lots of bug fixes
Table of contents
- Breaking changes: removed
reinforce()
- New features
- Unreduced losses
- A profiler for the autograd engine
- More functions support Higher order gradients
- New features in Optimizers
- New layers and nn functionality
- New Tensor functions and Features
- Other additions
- API changes
- Performance improvements
- Big reduction in framework overhead (helps small models)
- 4x to 256x faster Softmax/LogSoftmax
- More...
- Framework Interoperability
- DLPack Interoperability
- Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
- Bug Fixes (a lot of them)
Breaking changes
Stochastic functions, i.e. Variable.reinforce()
were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.
We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()
This is the new equivalent code:
probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
New features
Unreduced losses
Now, Some loss functions can compute per-sample losses in a mini-batch
- By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
- Now, a subset of loss functions allow specifying
reduce=False
to return individual losses for each sample in the mini-batch - Example:
loss = nn.CrossEntropyLoss(..., reduce=False)
- Currently supported losses:
MSELoss
,NLLLoss
,NLLLoss2d
,KLDivLoss
,CrossEntropyLoss
,SmoothL1Loss
,L1Loss
- More loss functions will be covered in the next release
An in-built Profiler in the autograd engine
We built a low-level profiler to help you identify bottlenecks in your models
Let us start with an example:
>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
... y = x ** 2
... y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
-------------------------------- ---------- ---------
Name CPU time CUDA time
------------------------------- ---------- ---------
PowConstant 142.036us 0.000us
N5torch8autograd9GraphRootE 63.524us 0.000us
PowConstantBackward 184.228us 0.000us
MulConstant 50.288us 0.000us
PowConstant 28.439us 0.000us
Mul 20.154us 0.000us
N5torch8autograd14AccumulateGradE 13.790us 0.000us
N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof
prefix. For example:
nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>
# in python
>>> with torch.cuda.profiler.profile():
... model(x) # Warmup CUDA memory allocator and profiler
... with torch.autograd.profiler.emit_nvtx():
... model(x)
Then, you can load trace_name.prof
in PyTorch and print a summary profile report.
>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)
Read additional documentation here
Higher order gradients
Added higher-order gradients support for the following layers
- ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
- PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
- MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
- DataParallel
Optimizers
- optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
- In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
- Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.
New layers and nn functionality
- Added AdpativeMaxPool3d and AdaptiveAvgPool3d
- Added LPPool1d
- F.pad now has support for:
- 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
- constant padding on n-d signals
- nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in
nearest
andlinear
modes. - grid_sample now allows padding with the border value via
padding_mode="border"
.grid_sample
expects a grid in the range of[-1, 1]
, and if the values are out of these bounds, padding with the value0.0
is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model. - Introducing
nn.utils.parameters_to_vector
andnn.utils.vector_to_parameters
parameters_to_vector
takesnet.parameters()
and return a 1D vector that contains all the parametersvector_to_parameters
takes a vector of flattened parameters and copies the values over to a network's parameters- Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
- Allow user to not specify certain input dimensions for
AdaptivePool*d
and infer them at runtime.- For example:
# target output size of 10x7 m = nn.AdaptiveMaxPool2d((None, 7))
- DataParallel container on CPU is now a no-op (instead of erroring out)
New Tensor functions and features
- Introduced
torch.erf
andtorch.erfinv
that compute the error function and the inverse error function of each element in the Tensor. - adds broadcasting support to bitwise operators
- Added
Tensor.put_
andtorch.take
similar tonumpy.take
andnumpy.put
.- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. - The put function copies value into a tensor also using linear indices.
- Differences from
numpy
equivalents:numpy.take
has an optional axis argument, which behaves likeindex_select
. Thisaxis
argument is not yet present.numpy.put
repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
- add
zeros
andzeros_like
for sparse Tensors. - 1-element Tensors can now be casted to Python scalars. For example:
int(torch.Tensor([5]))
works now.
Other additions
- Added
torch.cuda.get_device_name
andtorch.cuda.get_device_capability
that do what the names say. Example:>>> torch.cuda.get_device_name(0) 'Quadro GP100' >>> torch.cuda.get_device_capability(0) (6, 0)
- If one sets
torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithms torch.cuda_get_rng_state_all
andtorch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.
API changes
softmax
andlog_softmax
now take adim
argument that specifies the dimension in which slices are taken for the softmax operation.dim
allows negative dimensions as well (dim = -1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined onVariable
- Remove all instances of
device_id
and replace it withdevice
, to make things consistent torch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when usingtorch.autograd.grad
in large graphs with lists of inputs / outputs
For example:x, y = Variable(...), Variable(...) torch.autograd.grad(x * 2, [x, y]) # errors torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
pad_packed_sequence
now allows apadding_value
argument that can be used instead of zero-paddingDataset
now has a+
operator (which usesConcatDataset
). You can do something likeMNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence,src
is optional).recv
returns the...
Higher order gradients, Distributed PyTorch, Broadcasting, Advanced Indexing, New Layers and more
Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
Package documentation for this release is available at http://pytorch.org/docs/0.2.0/
We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.
Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.
Table of contents:
- Tensor Broadcasting (numpy-style)
- Advanced Indexing for Tensors and Variables
- Higher-order gradients
- Distributed PyTorch (multi-node training, etc.)
- Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
- New in torch and autograd: matmul, inverse, etc.
- Easier debugging, better error messages
- Bug Fixes
- Important Breakages and Workarounds
Tensor Broadcasting (numpy-style)
In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).
PyTorch Broadcasting semantics closely follow numpy-style broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.
General Semantics
Two tensors are “broadcastable” if the following rules hold:
- Each tensor has at least one dimension.
- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
For Example:
>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)
# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist
# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:
- If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
- Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.
For Example:
# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])
# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.
Advanced Indexing for Tensors and Variables
PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same []
-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...]
functions.
Let's look at some examples:
x = torch.Tensor(5, 5, 5)
Pure Integer Array Indexing - specify arbitrary indices at each dimension
x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])
also supports broadcasting, duplicates
x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])
arbitrary indexer shapes allowed
x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
[x[0][0][1], x[1][0][1]]]
can use colon, ellipse
x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]
also use Tensors to index!
y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]
selection with less than ndim, note the use of comma
x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]
Higher order gradients
Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.
In the 0.2
release, we've enabled the ability to compute higher order gradients for all of torch.XXX
functions and the most popular nn
layers. The rest will be covered in the next release.
Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.
import torch
from torchvision.models import resnet18
from torch.autograd import Variable
model = resnet18().cuda()
# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())
# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)
grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.
# now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()
# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes
grad_norm.backward()
# do an optimization step
optimizer.step()
We see two new concepts here:
- torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the
.grad
attributes. This is useful if you want to further operate on the gradients. - You can operate on the gradients, and call
backward()
on them.
The list of nn
layers that support higher order gradients are:
AvgPool*d
,BatchNorm*d
,Conv*d
,MaxPool1d,2d
,Linear
,Bilinear
pad
,ConstantPad2d
,ZeroPad2d
,LPPool2d
,PixelShuffle
ReLU6
,LeakyReLU
,PReLU
,Tanh
,Tanhshrink
,Threshold
,Sigmoid
,HardTanh
,ELU
,Softsign
,SeLU
L1Loss
,NLLLoss
,PoissonNLLLoss
,LogSoftmax
,Softmax2d
The rest will be enabled in the next release.
To enable higher order gradients, we've introduced a new style of writing autograd.Function
(the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.
Most of you dont write your own autograd.Function
s, they are low-level primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.
Distributed PyTorch
We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
The distributed
package follows an MPI-style programming model. This means that there are functions provided to you such as send
, recv
, all_reduce
that will exchange Tensors among nodes (machines).
For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:
- shared file system (requires that all processes can access a single file system)
- IP multicast (requires that all processes are in the same network)
- environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)
Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:
import torch.distributed as dist
dist.init_process_group(backend='tcp',
init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
world_size=4)
print('Hello from process {} (out of {})!'.format(
dist.get_rank(), dist.get_world_size()))
This would print Hello from process 2 (out of 4)
on the 3rd machine.
World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.
Here's a snippet that shows how simple point-to-point communication can be performed:
# All proces...