How to use sparse.mm in float16 training pipeline #5282

fuy34 · 2020-12-27T21:57:31Z

What is your question?

How can we assign certain operation (e.g. torch.sparse.mm) as float32 operation in float16 training setting?

Details and what I have tried

I am trying to train a model using

pl.Trainer(distributed_backend='ddp', precision=16, amp_level='01', gpus=2)

and I need to use sparse tensor multiplication in the forward loop. I got RuntimeError: "addmm_sparse_cuda" not implemented for 'Half' as reported in Pytorch issue #41069. However, this error remains even after I changed the variable type into float32.

I guess the apex or pytorch-lightening is still calling the sparse.mm with float16 setting. Is it possible to assign certain operation in the float16 training pipeline as float32 operation? Or if there is any alternative way that I can use torch.sparse.mm within float16 training process.

Reproduce

Initialize any model (e.g. the official MNIST demo), set

trainer = pl.Trainer(distributed_backend='ddp', precision=16, amp_level='01')

add following code in the forward function

a = torch.randn(3,2).float().cuda()
i = torch.LongTensor([[0, 1, 1],  [2, 0, 2]]) 
v = torch.FloatTensor([3, 4, 5]) 
b = torch.sparse.FloatTensor(i, v, torch.Size([2,3])).float().cuda()
c = torch.sparse.mm(b, a)

I cannot afford to do c= b.to_dense() @ a in practice, because of the limited GPU memory.

What's your environment?

OS: Ubuntu 16.04
Packaging: conda
Pytorch: v1.6.0
Pytorch_lightning: v0.9.0
CUDA: 10.2

The text was updated successfully, but these errors were encountered:

github-actions · 2020-12-27T21:58:15Z

Hi! thanks for your contribution!, great first issue!

fuy34 · 2020-12-28T06:15:10Z

--------- Update -------
I am not sure if this is the right way to do, but it seems work for me by adding

with torch.cuda.amp.autocast(enabled=False):
    ...<operations>...

on the top of the operations.

awaelchli · 2020-12-28T09:04:34Z

isn't it "O1" for amp level?

For torch native amp, see here a list of ops that can be autocast:
https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float16

fuy34 · 2020-12-28T21:05:19Z

Yes, "01" is for amp level.

I have a CNN model to train, and there is one operation using sparse tensor in the forward loop. More specifically, the model has a self.sparse_tensor variable, and for the feature from the CNN module, I do new_feat = self.sparse_tensor @ feature and give the new_feat to the next CNN module.

The setting I mentioned of pl.Trainer just shows how I train the model. I am not sure if it is related to how the Pytorch-Lightning call the torch.sparse.mm. That is why I present above.

awaelchli · 2020-12-29T07:10:03Z

Yes, "01" is for amp level.

No, I'm saying it should be "O1" not "01".

PL doesn't convert ops and tensors directly, it relies on either Apex or native torch amp. As you can see in the link I posted, sparse matrix mul is not a supported one (by torch native amp)

awaelchli · 2020-12-29T07:15:51Z

When pytorch/pytorch#41069 gets implemented, Lightning will automatically support it.

fuy34 · 2020-12-29T19:45:56Z

Oho, I see. Please excuse me. I do not know why I kept typing "01". It is "O1" for sure.
I think I will temporally do the "dirty" way I mentioned above, while expecting the new feature from torch.
Thank you!

awaelchli · 2020-12-29T20:28:48Z

okay, let me know if you run into more questions.

fuy34 added the question Further information is requested label Dec 27, 2020

fuy34 closed this as completed Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use sparse.mm in float16 training pipeline #5282

How to use sparse.mm in float16 training pipeline #5282

fuy34 commented Dec 27, 2020

github-actions bot commented Dec 27, 2020

fuy34 commented Dec 28, 2020

awaelchli commented Dec 28, 2020

fuy34 commented Dec 28, 2020

awaelchli commented Dec 29, 2020 •

edited

awaelchli commented Dec 29, 2020

fuy34 commented Dec 29, 2020

awaelchli commented Dec 29, 2020

How to use sparse.mm in float16 training pipeline #5282

How to use sparse.mm in float16 training pipeline #5282

Comments

fuy34 commented Dec 27, 2020

What is your question?

Details and what I have tried

Reproduce

What's your environment?

github-actions bot commented Dec 27, 2020

fuy34 commented Dec 28, 2020

awaelchli commented Dec 28, 2020

fuy34 commented Dec 28, 2020

awaelchli commented Dec 29, 2020 • edited

awaelchli commented Dec 29, 2020

fuy34 commented Dec 29, 2020

awaelchli commented Dec 29, 2020

awaelchli commented Dec 29, 2020 •

edited