Skip to content

Bug fixes and performance improvements

Compare
Choose a tag to compare
@soumith soumith released this 14 Feb 00:36

Binaries

  • Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
  • Stop binary releases for CUDA 7.5
  • Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.

As always, links to our binaries are on http://pytorch.org

New features

Bug Fixes

Data Loader / Datasets / Multiprocessing

  • Made DataLoader workers more verbose on bus error and segfault. Additionally, add a timeout option to the DataLoader, which will error if sample loading time exceeds the given value. #3474
  • DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of fork syscall. Now, each worker will have it's RNG seed set to base_seed + worker_id where base_seed is a random int64 value generated by the parent process. You may use torch.initial_seed() to access this value in worker_init_fn, which can be used to set other seeds (e.g. NumPy) before data loading. worker_init_fn is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018
  • Add additional signal handling in DataLoader worker processes when workers abruptly die.
  • Negative value for n_workers now gives a ValueError #4019
  • fixed a typo in ConcatDataset.cumulative_sizes attribute name #3534
  • Accept longs in default_collate for dataloader in python 2 #4001
  • Re-initialize autograd engine in child processes #4158
  • Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196

CUDA / CuDNN

  • allow cudnn for fp16 batch norm #4021
  • Use enabled argument in torch.autograd.profiler.emit_nvtx (was being ignored) #4032
  • Fix cuBLAS arguments for fp16 torch.dot #3660
  • Fix CUDA index_fill_ boundary check with small tensor size #3953
  • Fix CUDA Multinomial checks #4009
  • Fix CUDA version typo in warning #4175
  • Initialize cuda before setting cuda tensor types as default #4788
  • Add missing lazy_init in cuda python module #4907
  • Lazy init order in set device, should not be called in getDevCount #4918
  • Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936

CPU

  • Assert MKL ld* conditions for ger, gemm, and gemv #4056

torch operators

  • Fix tensor.repeat when the underlying storage is not owned by torch (for example, coming from numpy) #4084
  • Add proper shape checking to torch.cat #4087
  • Add check for slice shape match in index_copy_ and index_add_. #4342
  • Fix use after free when advanced indexing tensors with tensors #4559
  • Fix triu and tril for zero-strided inputs on gpu #4962
  • Fix blas addmm (gemm) condition check #5048
  • Fix topk work size computation #5053
  • Fix reduction functions to respect the stride of the output #4995
  • Improve float precision stability of linspace op, fix 4419. #4470

autograd

  • Fix python gc race condition with THPVariable_traverse #4437

nn layers

  • Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
    Fix cosine_similarity's output shape #3811
  • Add rnn args check #3925
  • NLLLoss works for arbitrary dimensions #4654
  • More strict shape check on Conv operators #4637
  • Fix maxpool3d / avgpool3d crashes #5052
  • Fix setting using running stats in InstanceNorm*d #4444

Multi-GPU

  • Fix DataParallel scattering for empty lists / dicts / tuples #3769
  • Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
  • Broadcast output requires_grad only if corresponding input requires_grad #5061

core

  • Remove hard file offset reset in load() #3695
  • Have sizeof account for size of stored elements #3821
  • Fix undefined FileNotFoundError #4384
  • make torch.set_num_threads also set MKL threads (take 2) #5002

others

  • Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656

Performance improvements

  • slightly simplified math in IndexToOffset #4040
  • improve performance of maxpooling backwards #4106
  • Add cublas batched gemm support. #4151
  • Rearrange dimensions for pointwise operations for better performance. #4174
  • Improve memory access patterns for index operations. #4493
  • Improve CUDA softmax performance #4973
  • Fixed double memory accesses of several pointwise operations. #5068

Documentation and UX Improvements

  • Better error messages for blas ops with cuda.LongTensor #4160
  • Add missing trtrs, orgqr, ormqr docs #3720
  • change doc for Adaptive Pooling #3746
  • Fix MultiLabelMarginLoss docs #3836
  • More docs for Conv1d Conv2d #3870
  • Improve Tensor.scatter_ doc #3937
  • [docs] rnn.py: Note zero defaults for hidden state/cell #3951
  • Improve Tensor.new doc #3954
  • Improve docs for torch and torch.Tensor #3969
  • Added explicit tuple dimensions to doc for Conv1d. #4136
  • Improve svd doc #4155
  • Correct instancenorm input size #4171
  • Fix StepLR example docs #4478