Bug fixes and performance improvements
Binaries
- Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
- Stop binary releases for CUDA 7.5
- Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.
As always, links to our binaries are on http://pytorch.org
New features
- Add Cosine Annealing Learning Rate Scheduler #3311
- add
reduce
argument toPoissonNLLLoss
to be able to compute unreduced losses #3770 - Allow
target.requires_grad=True
inl1_loss
andmse_loss
(compute loss wrttarget
) #3876 - Add
random_split
that randomly splits a dataset into non-overlapping new datasets of given lengths #4435 - Introduced scopes to annotate ONNX graphs to have better TensorBoard visualization of models #5153
Allowmap_location
intorch.load
to be a string, such asmap_location='cpu'
ormap_location='cuda:2'
#4203
Bug Fixes
Data Loader / Datasets / Multiprocessing
- Made DataLoader workers more verbose on bus error and segfault. Additionally, add a
timeout
option to the DataLoader, which will error if sample loading time exceeds the given value. #3474 - DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of
fork
syscall. Now, each worker will have it's RNG seed set tobase_seed + worker_id
wherebase_seed
is a random int64 value generated by the parent process. You may usetorch.initial_seed()
to access this value inworker_init_fn
, which can be used to set other seeds (e.g. NumPy) before data loading.worker_init_fn
is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018 - Add additional signal handling in DataLoader worker processes when workers abruptly die.
- Negative value for n_workers now gives a ValueError #4019
- fixed a typo in
ConcatDataset.cumulative_sizes
attribute name #3534 - Accept longs in default_collate for dataloader in python 2 #4001
- Re-initialize autograd engine in child processes #4158
- Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196
CUDA / CuDNN
- allow cudnn for fp16 batch norm #4021
- Use
enabled
argument intorch.autograd.profiler.emit_nvtx
(was being ignored) #4032 - Fix cuBLAS arguments for fp16
torch.dot
#3660 - Fix CUDA index_fill_ boundary check with small tensor size #3953
- Fix CUDA Multinomial checks #4009
- Fix CUDA version typo in warning #4175
- Initialize cuda before setting cuda tensor types as default #4788
- Add missing lazy_init in cuda python module #4907
- Lazy init order in set device, should not be called in getDevCount #4918
- Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936
CPU
- Assert MKL ld* conditions for ger, gemm, and gemv #4056
torch operators
- Fix
tensor.repeat
when the underlying storage is not owned bytorch
(for example, coming from numpy) #4084 - Add proper shape checking to torch.cat #4087
- Add check for slice shape match in index_copy_ and index_add_. #4342
- Fix use after free when advanced indexing tensors with tensors #4559
- Fix triu and tril for zero-strided inputs on gpu #4962
- Fix blas addmm (gemm) condition check #5048
- Fix topk work size computation #5053
- Fix reduction functions to respect the stride of the output #4995
- Improve float precision stability of
linspace
op, fix 4419. #4470
autograd
- Fix python gc race condition with THPVariable_traverse #4437
nn layers
- Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
Fix cosine_similarity's output shape #3811 - Add rnn args check #3925
- NLLLoss works for arbitrary dimensions #4654
- More strict shape check on Conv operators #4637
- Fix maxpool3d / avgpool3d crashes #5052
- Fix setting using running stats in InstanceNorm*d #4444
Multi-GPU
- Fix DataParallel scattering for empty lists / dicts / tuples #3769
- Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
- Broadcast output requires_grad only if corresponding input requires_grad #5061
core
- Remove hard file offset reset in load() #3695
- Have sizeof account for size of stored elements #3821
- Fix undefined FileNotFoundError #4384
- make torch.set_num_threads also set MKL threads (take 2) #5002
others
- Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656
Performance improvements
- slightly simplified math in IndexToOffset #4040
- improve performance of maxpooling backwards #4106
- Add cublas batched gemm support. #4151
- Rearrange dimensions for pointwise operations for better performance. #4174
- Improve memory access patterns for index operations. #4493
- Improve CUDA softmax performance #4973
- Fixed double memory accesses of several pointwise operations. #5068
Documentation and UX Improvements
- Better error messages for blas ops with cuda.LongTensor #4160
- Add missing trtrs, orgqr, ormqr docs #3720
- change doc for Adaptive Pooling #3746
- Fix MultiLabelMarginLoss docs #3836
- More docs for Conv1d Conv2d #3870
- Improve Tensor.scatter_ doc #3937
- [docs] rnn.py: Note zero defaults for hidden state/cell #3951
- Improve Tensor.new doc #3954
- Improve docs for torch and torch.Tensor #3969
- Added explicit tuple dimensions to doc for Conv1d. #4136
- Improve svd doc #4155
- Correct instancenorm input size #4171
- Fix StepLR example docs #4478