Alpha-4 Release
Pre-releaseSome interesting stats
On Resnets
Because of our aggressive freeing and allocating resources, ResNets in PyTorch take lesser memory than torch-nn
- 4.4GB in PyTorch
- 6.5GB in Torch-nn
- 4.6GB in Torch-nn with a hacky sharing of gradinput buffers
- On 1-GPU, PyTorch speed is 10s of milliseconds faster than Torch-nn
- On 2-GPUs, PyTorch is the same speed as Torch-nn
- On 4-GPUs, PyTorch is about 10 to 20% slower, but it's because we have just finished implementing Multi-GPU and we will be plugging this perf difference in the next week.
FFI-based C extension
On a small benchmark of adding a constant to a 5x5 tensor at 1000 calls:
- LuaJIT FFI: 0.001 seconds
- Lua 5.2 FFI: 0.003 seconds
- PyTorch CFFI: 0.003 seconds
- Raw Python CFFI / CTypes: 0.001 seconds
What's new in Alpha-4?
Usability
- Two Tutorials, now located at: https://github.com/pytorch/tutorials
- Examples:
- A full Imagenet / ResNet example is now located at: https://github.com/pytorch/examples/tree/master/imagenet
- it works! :)
- Has performant Multi-GPU support
- A full Imagenet / ResNet example is now located at: https://github.com/pytorch/examples/tree/master/imagenet
- More improved error messages and shape checks across the board in pytorch, TH, THNN
torch.*
functions now dont useCamelCase
, but useunderscore_case
. Example:torch.index_add_
New Features and modules
- Multi-GPU primitives
- A custom CUDA allocator to maximize autograd performance (backported to Torch too)
- More autograd functions. Now it's almost API complete for all differentiable
torch.*
functions. - CuDNN Integration
- Multiprocess DataLoader in
torch.utils
(used in the imagenet example) - Extensions API to interface to your C code simply via FFI
Plans for Alpha-5
- Revamping and rethinking the Checkpointing API
- Revamping the Optim API to support things like per-layer learning rates and optimizing non-weights (like in NeuralStyle)
- RNN Examples, initially for PennTreeBank language modeling
- Better RNN support in general, improved error messages, multi-GPU etc.
- NCCL Integration for improved multi-GPU performance (already implemented at #78 )
- Documentation / Reference manual for
torch.*
andautograd
Usability
Tutorials
We've added two tutorials to get you all started.
- Tutorial 1: Introduction to PyTorch for former Torchies
- In this tutorial we cover the torch, autograd and nn packages from a perspective of former Torch users.
- Going through this tutorial should get you started. Let us know how we can improve it.
- Tutorial 2: Write your own C code that interfaces into PyTorch via FFI
- In this tutorial, we showcase how you can call your own C code that takes torch tensors as inputs / outputs in a seamless way via FFI
- The tutorial showcases how you can write your own neural network Module that calls in C implementations
Examples
We've added a full imagenet example with ResNets that should be really suited towards “learning by example”.
It is located here: https://github.com/pytorch/examples/tree/master/imagenet
The data for the example has to be preprocessed for now in the same way as is specified in fb.resnet.torch
The example has Multi-GPU support in a DataParallel fashion.
More improved error messages
We've gone through the TH and THNN C libraries and added much more intuitive error messages that report the mismatched shapes. We will continue to make improvements on this front.
If you have any unintuitive error messages that you encounter, please open an issue at https://github.com/pytorch/pytorch/issues
For example:
Old error message:
bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input
New error message:
bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input, but got: [100 x 100]
No more CamelCase for functions
All torch functions have been renamed from CamelCase to underscore_case.
indexAdd → index_add_
getRNGState → get_rng_state
etc.
New Features and modules
Multi-GPU primitives
- We've added efficient multi-GPU support in general for neural networks. Instead of building magic blocks that do opaque parallelization for you, we've broken them down into easy to use collectives.
- A pattern like DataParallel is implemented in terms of:
- replicate, scatter, gather, parallel_apply
- These are reusable collectives for implementing other multi-gpu patterns as well
- https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/__init__.py#L24-L38
Performance
With Multi-GPU, we naturally overlap data transfers with compute across the whole graph. This makes multi-GPU much more efficient, and is done in a way that does not interfere with the imperativeness / error reporting.
Another important note is that we now dispatch parallel modules via python threads, which makes the CUDA kernel launches in a breadth-first fashion, getting rid of obvious kernel launch latency bottlenecks.
Custom CUDA allocator to maximize autograd performance
In Torch, we had to write nn modules in a careful way to avoid cuda synchronization points which were a multi-GPU bottleneck and general performance bottleneck. This affected neural networks and autograd sometimes up to 2x in performance penalty.
In PyTorch (and Torch), Sam Gross has written a new Caching CUDA allocator that avoids cuda synchronization points while being really suited towards Tensor use-cases where we typically do short-term and long-term allocations of memory of the same tensor sizes.
This unblocks us from a lot of performance issues.
More autograd functions
Now the torch.* API should be pretty much be ready for full autograd support (short of 3 functions).
Autograd has been enabled for all the functions with the exception of non-differentiable functions like torch.eq.
CuDNN Integration
We now fully integrate and support CuDNN version 5.1.3, and it is shipped in the binaries (just like CUDA), so you never have to worry about manually downloading and installing it from the NVIDIA website.
Generic Multiprocess DataLoader
We've added a flexible Data Loader that supports multiple data loading workers. This enables a lot of use-cases, and is first used in our Imagenet example.
C Extensions API
We added an easy to use extensions API and an example extension here:
https://github.com/pytorch/extension-ffi
You can call your C functions (that have TH*Tensor inputs / outputs and other fundamental types in the function signature) without writing any manual Python bindings.
One question you might have is, what kind of call overhead these auto-generated FFI bindings have. The answer is “None”, as seen in the numbers in the beginning of the note.
The example extension also covers how you can define your autograd-ready nn module that calls your C function.