Sparse support for CUDA, bug fixes, performance improvements
API Changes
torch.range
is deprecated in favor oftorch.arange
which is consistent with numpy and python range.- On sparse Tensors,
contiguous
is renamed tocoalesce
andcoalesce
is now made out-of-place.
(a reminder that Sparse API is still experimental and evolving, so we dont provide backward-compability).
New Features
New layers and functions
torch.topk
is now supported for all CUDA types, not justtorch.cuda.FloatTensor
.- Added a three-way ranking loss: nn.TripletMarginLoss
- Added per-instance normalization layers: nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d
Each channel is treated as an instance to normalize, and mean-subtraction and std-division is done. This is useful when dealing with larger images and smaller mini-batches where BatchNorm like effects are desired. nn.ZeroPad2d
andnn.ConstantPad2d
are added.nn.Bilinear
is added, which computesY = X1 * W * X2 + b
Negative dimension support for all functions
Every single function that took a dimension argument will also allow taking negative dimensions.
A negative dimension will index the tensor from the last dimension.
For example:
x = torch.randn(10, 20, 30)
y = torch.mean(x, dim = -1)
Here, since x
has 3 dimensions, and dim = -1
, the last dimension, i.e. dim=3
is picked for taking a mean.
The functions with dimension arguments are:
narrow, transpose, size, cat, chunk, gather, index_select, split, squeeze,
stack, unbind, unsqueeze, cumprod, cumsum, mean, median, mode, norm, prod, std,
sum, var, kthvalue, max, min, sort, topk, renorm,
index_add, index_copy, index_fill, scatter, select, unfold
CUDA support for Sparse Tensors, faster CPU sparse
Now a part of the torch.sparse
API is also supported for torch.cuda.sparse.*Tensor
.
Functions that are supported on CUDA are:
sparse_mask, to_dense, coalesce, transpose, spaddmm
spcadd, mul, div, cadd, csub, cmul
nn.Embedding
now supports sparse even on CUDA (with the sparse=True
flag) leveraging these sparse functions.
A new hybrid matrix-multiply hspmm
operation that multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension).
Several of the CPU sparse functions have more efficient implementations.
In a quickly hacked up Embedding classifier training script by @martinraison we see CUDA sparse performing as well as CUDA dense:
https://gist.github.com/martinraison/1e7c18c6f6eda87f1cb4995b0e6a22a5
Table times of seconds / batch
_ | CPU | CUDA |
---|---|---|
Dense | 10 | 0.86 |
Sparse | 0.15 | 0.13 |
named_parameters to filter out specific parameter types
Let's say that you want to add weight decay to all parameters of your model except for the biases. How do you get only the biases of your model?
We introduce nn.Module.named_parameters for this.
It joins named_children
and named_modules
in helping you filter specific attributes of models.
Example of filtering out biases of a model and give them weight_decay of 0:
import torch
import torch.nn as nn
import torch.optim as optim
m = nn.Sequential(
nn.Linear(10, 20),
nn.ReLU(),
nn.Linear(20, 20),
nn.ReLU(),
)
weights, biases = [], []
for name, p in m.named_parameters():
if 'bias' in name:
biases += [p]
else:
weights += [p]
optim.SGD([
{'params': weights},
{'params': biases, weight_decay=0}
], lr=1e-2, momentum=0.9, weight_decay=1e-5)
Performance Improvements
cumsum
andcumprod
have been significantly made faster on the GPU via using some thrust primitives where appropriate.LSTMCell
andGRUCell
are now significantly faster on the GPU via a fused kernel- The default Algorithm for CuDNN has been changed to
PRECOMP_GEMM
which is a
much faster algorithm that takes a tiny bit of workspace. Previously, it used to
beIMPLICIT_GEMM
which took zero workspace, but was significantly slower. - 5% to 10% improvement in data loader by collating batches directly into shared memory.
- SVD is now computed on the GPU via divide-and-conquer (sgesdd) which gives a 2x to 5x speedup.
- The commonly used function
expand
has been moved to C, to have better performance in smaller models.
Bug Fixes
- Added contiguous checks on weight and bias for a large range of THNN functions
- make the range of
random_
correct when both lower and upper bound are specified parallel_apply
now can take arguments that are unhashable- Reshape
grad
correctly in the Dot function (inputs don't have to be 1D vectors...) - Added
Variable.type_as
- Unify argument names of
norm
andrenorm
to havep=norm_type, dim=dim
btrisolve
works on CPU doubles- ipython autocomplete for torch.nn.Module fixed via implementing
__dir__
device_ids
can now beNone
again inF.data_parallel
and will use all available GPUs- workaround cudnn bugs in BatchNorm (<5.1.10) and Dilation (6.0.20)
- Padding bugfix in Conv1d CPU
remainder
andcremainder
are fixed for integer types- fix memory leak in
btrisolve
andgetri
- If nn.Module's source cant be retrieved because of any exception,
handle serialization to be non-fatal collate_fn
now retains the type of the numpy arrayis_tensor
andis_storage
are now fixed for old-style Python classestorch.cat
now supports keyword arguments- CUDA collectives supported coalescing, but the inputs were all assumed
to be of the same Tensor type. This is fixed. - Fix a deadlock bug in autograd because of an underlying glibc bug in specific
linux distros (ArchLinux in particular) abs
is now fixed forchar
andshort
cuda types- fix
torch.diag
autograd when giving a dimension argument - fix grouped convolution on CPU when
bias=False
- expose
dilated
convolutions forConvTranspose*d
- Fix a bug in
HingeEmbeddingLoss
wheremargin
can now be specified via kwargs
Improved error messages
- Fix errors and messages when no CUDA devices are available.