Release Bug fixes and performance improvements · pytorch/pytorch

Binaries

Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
Stop binary releases for CUDA 7.5
Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.

As always, links to our binaries are on http://pytorch.org

Add Cosine Annealing Learning Rate Scheduler #3311
add reduce argument to PoissonNLLLoss to be able to compute unreduced losses #3770
Allow target.requires_grad=True in l1_loss and mse_loss (compute loss wrt target) #3876
Add random_split that randomly splits a dataset into non-overlapping new datasets of given lengths #4435
Introduced scopes to annotate ONNX graphs to have better TensorBoard visualization of models #5153
Allow map_location in torch.load to be a string, such as map_location='cpu' or map_location='cuda:2' #4203

Made DataLoader workers more verbose on bus error and segfault. Additionally, add a timeout option to the DataLoader, which will error if sample loading time exceeds the given value. #3474
DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of fork syscall. Now, each worker will have it's RNG seed set to base_seed + worker_id where base_seed is a random int64 value generated by the parent process. You may use torch.initial_seed() to access this value in worker_init_fn, which can be used to set other seeds (e.g. NumPy) before data loading. worker_init_fn is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018
Add additional signal handling in DataLoader worker processes when workers abruptly die.
Negative value for n_workers now gives a ValueError #4019
fixed a typo in ConcatDataset.cumulative_sizes attribute name #3534
Accept longs in default_collate for dataloader in python 2 #4001
Re-initialize autograd engine in child processes #4158
Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196

allow cudnn for fp16 batch norm #4021
Use enabled argument in torch.autograd.profiler.emit_nvtx (was being ignored) #4032
Fix cuBLAS arguments for fp16 torch.dot #3660
Fix CUDA index_fill_ boundary check with small tensor size #3953
Fix CUDA Multinomial checks #4009
Fix CUDA version typo in warning #4175
Initialize cuda before setting cuda tensor types as default #4788
Add missing lazy_init in cuda python module #4907
Lazy init order in set device, should not be called in getDevCount #4918
Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936

Fix tensor.repeat when the underlying storage is not owned by torch (for example, coming from numpy) #4084
Add proper shape checking to torch.cat #4087
Add check for slice shape match in index_copy_ and index_add_. #4342
Fix use after free when advanced indexing tensors with tensors #4559
Fix triu and tril for zero-strided inputs on gpu #4962
Fix blas addmm (gemm) condition check #5048
Fix topk work size computation #5053
Fix reduction functions to respect the stride of the output #4995
Improve float precision stability of linspace op, fix 4419. #4470

Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
Fix cosine_similarity's output shape #3811
Add rnn args check #3925
NLLLoss works for arbitrary dimensions #4654
More strict shape check on Conv operators #4637
Fix maxpool3d / avgpool3d crashes #5052
Fix setting using running stats in InstanceNorm*d #4444

Fix DataParallel scattering for empty lists / dicts / tuples #3769
Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
Broadcast output requires_grad only if corresponding input requires_grad #5061