Skip to content

Releases: pytorch/pytorch

alpha-2 Release

01 Sep 05:03
Compare
Choose a tag to compare
alpha-2 Release Pre-release
Pre-release

What's new?

We've

  • built seamless support for multiprocessing with Tensor sharing
  • changed the API of the optim engine
  • added a complete Hook system for nn and autograd
  • added in-place ops to autograd and more neural network modules to nn

Multiprocessing with Tensor sharing

In Torch, or in general, one uses "threads" to build parallel data loaders, as well as to do Hogwild training.
Threads are powerful, as one can share Tensors between threads.
This allows you to:

  • transfer data between threads with efficiently with zero memory copy and serialization overhead.
  • share tensors among threads for parameter sharing models

Sharing Tensors among threads is very useful when you do Hogwild training, i.e. if you want to train several models in parallel, but want to share their underlying parameters.
This is often used in non ConvNets, like training word embeddings, RL-for-games, etc.

With Python, one cannot use threads because of a few technical issues.
Python has what is called Global Interpreter Lock, which does not allow threads to concurrently execute python code.

Hence, the most pythonic way to use multiple CPU cores is multiprocessing

We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read in this in-depth technical discussion.

What this means for you as the end-user is that you can simply use multiprocessing in this way:

# loaders.py
# Functions from this file run in the workers

def fill(queue):
  while True:
    tensor = queue.get()
    tensor.fill_(10)
    queue.put(tensor)

def fill_pool(tensor):
  tensor.fill_(10)
# Example 1: Using multiple persistent processes and a Queue
# process.py

import torch
import torch.multiprocessing as multiprocessing
from loaders import fill

# torch.multiprocessing.Queue automatically moves Tensor data to shared memory
# So the main process and worker share the data
queue = multiprocessing.Queue()
buffers = [torch.Tensor(2, 2) for i in range(4)]
for b in buffers:
  queue.put(b)
processes = [multiprocessing.Process(target=fill, args=(queue,)).start() for i in range(10)]
# Example 2: Using a process pool
# pool.py

import torch
from torch.multiprocessing import Pool
from loaders import fill_pool

tensors = [torch.Tensor(2, 2) for i in range(100)]
pool = Pool(10)
pool.map(fill_pool, tensors)

Optim's API changes

Optimizer's step function now accepts a closure that should return a loss variable (similar to legacy.optim).

We've realized that to keep Optim flexible for multiple methods, like SGD with nesterov, Conjugate Gradient, LBFGS etc., we need to have the input to optim be a function that evaluates the model.
This is necessary because several optimization methods re-evaluate the function multiple times at different parameters.
To come to this necessary API change, we took into account complicated scenarios like Dynamic RNNs and complex ConvNet models with dynamic branching.

So the API now looks like this:

optimizer = optim.SGD(model, lr=1e-3, momentum)
input, target = ...
optimizer.step(lambda: criterion(model(input), target)) #sufficient for simple models

To simplify things at the user end for simple or specific common models, we will introduce a Trainer class, that will take a (dataset, model, optim) triple and train the model. This trainer class is planned for alpha-3.

A complete Hook system for nn and autograd

Accessing intermediate values during the forward pass is straightforward, but during backward the buffers can rapidly change their content (for example: when doing in-place optimizations).

If you want to get access to the gradients at a particular Op or Layer inside your model, one uses a hook system.
Hooks can be attached to variables or to modules and are called as soon as the gradient is available:

# Example in autograd
a, b, c = [Variable(torch.Tensor(5, 5)) for i in range(3)]

def print_norm(grad):
    print(grad.norm(2))

y = b * c + a
y.register_hook(print_norm)

z = y * y - b
z.backward(torch.ones(5, 5))

# Example in nn
model = ...

def inspect_forward(module, input, output):
    ...

model.conv2.register_forward_hook(inspect_forward)

def inspect_backward(module, grad_input, grad_output):
    ...

model.conv2.register_backward_hook(inspect_backward)

We would definitely look forward to comments about the Hook system. Let us know what you think.

Added in-place ops to autograd and more neural network modules to nn

  • As part of porting fb.resnet.torch, we've added AveragePool2d and fixed BatchNorm2d
  • Now, autograd fully supports in-place operations, with in-place variables immediately marked as dirty.
    To illustrate this, let's look at a small example
x = Variable(torch.ones(5, 5))
y = Variable(torch.ones(5, 5) * 4)

z = x * y
q = z * y
r = z + y
z.add_(y)
# z is a the last expression, so this should succeed
z.backward(torch.ones(5, 5))

# r doesn't use the z in it's backward, so it should succeed
r.backward(torch.ones(5, 5))

# however, q needs z in it's backward, but z has now been 
# marked as dirty (because it was used in an in-place operation)
# this line will hence raise an error
q.backward(torch.ones(5, 5))

Plans for alpha-3

  • Unit tests for multiprocessing
  • Add more nn modules and autograd functions ( we're porting fb.resnet.torch )
  • New CUDA memory allocator (non-synchronizing CUDA tensors allocations)
    • We've made progress on this, but it is not complete yet
  • Trainer and Dataset classes
  • Continuous builds for CUDA (using Nimbix)
  • Binary packages (nightly and versioned)

alpha-1 release

01 Sep 05:01
Compare
Choose a tag to compare
alpha-1 release Pre-release
Pre-release

It's been a week since pytorch alpha-0.
We're excited to now present alpha-1 :)

What's new?

We've built a working and unit-tested version of the new nn and autograd packages (torch.nn, torch.autograd) along with a basic draft optim package (torch.optim). The old packages will continue to be available at torch.legacy.*

We've also built fully working serialization (torch.save / torch.load) with features that one expects out of the box like sharing staying intact.

At this point, you can play around with things and get a feel of the new design.

There's an MNIST example at https://github.com/pytorch/examples

A concern raised about pytorch was that Python is a slow language.

It turns out that the MNIST example runs in exactly the same amount of time / epoch in both pytorch and (lua)Torch, and we haven't yet done any optimizations in the code in pytorch yet.

Another notable thing is that pytorch uses 1500MB of system memory vs (lua)Torch's 2300MB. This is before we've added any in-place optimizations into pytorch. The design of the new nn allows us to add seamless memory optimizations without needing the user to mark things as in-place or out-of-place which will bring us more seamless memory savings in pytorch.

More verbosely:

torch.nn

We've published an early version of the new nn package.
There are only a few modules right now, but we'll be adding more soon.

There are a couple of advantages over to old package:

  • Modules no longer hold temporary buffers and short-lived state. This allows to use the same module a number of times in forward pass, and the gradients will be summed automatically. For example, see how we use the same nn.ReLU object multiple times over here: https://github.com/pytorch/examples/blob/master/mnist/main.py#L43
  • There's no longer any need for using rigid container modules. Your model is defined by your code. You can select a completely different path across your model just by adding a number of ifs. Any crazy branching schemes inside your model are allowed by design.
  • It's fully compatible with autograd. Instead of using nn.Add or nn.Index you can just write this in your model definition: y = module1(x_1)[0] + module2(x_2).
  • You can register both forward and backward hooks at each module, which allow you to inspect the intermediate outputs and gradients flowing through the network and the graph.
  • [Not Yet Implemented] Safe in-place operations. Tensors used in in-place operations are marked as dirty, and trying to use them in any way raises an error.

torch.autograd

Autograd at the core of pytorch. Enabling it is just a matter of wrapping your tensors in Variable objects before starting the computation (x = Variable(x)). Then, when you have your output you can either call y.backward() if it's a scalar, or provide gradient w.r.t. the variable as an argument (y.backward(grad_output)). Gradients w.r.t. variables are then available in their .grad attributes. Please note that only gradients of leaf variables (i.e. created by the user) are computed. If you want to access any gradients of intermediate values, you'll have to use a hook system.

If you don't want to compute gradient for some variables, you can even mark them in a constructor with requires_grad=False, and they will be optimized out from the backward pass.

torch.optim

Please note that this api is still a bit experimental, and is likely to undergo changes soon.

optim has a different, more object oriented API. First, you have to create an optimizer object optimizer = optim.sgd(model, lr=1e-3, momentum=0.9). If you don't want to merge the model and criterion in a single object, it's also possible to pass a tuple of (model, criterion) as the first argument to a constructor. Then, in your training loop you just call loss = optimizer.step(input) (in case of separate model and criterion input should be a tuple of (input, target)). This accumulates all the gradients and performs a single optimization step on the parameters.

Serialization

Tensors supported pickle protocol since the beginning of alpha, but pickle can't handle storage/data sharing properly and requires all the data to be copied before serialization.
We've created torch.load and torch.save, that have the same interface and solve both of these problems.

Tensor operators

Thanks to @bart we've added support for @ operator for matrix multiplication, and changes the * to elementwise multiplication.

Plans for alpha-2:

  • Hook system for nn and autograd (for accessing intermediate values)
  • More nn modules, autograd options, and optim algorithms
  • Inter-process sharing of tensors (for multiprocess data loading or hogwild training)
  • New CUDA memory allocator (non-synchronizing CUDA tensors allocations)