Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use sparse.mm in float16 training pipeline #5282

Closed
fuy34 opened this issue Dec 27, 2020 · 8 comments
Closed

How to use sparse.mm in float16 training pipeline #5282

fuy34 opened this issue Dec 27, 2020 · 8 comments
Labels
question Further information is requested

Comments

@fuy34
Copy link

fuy34 commented Dec 27, 2020

What is your question?

How can we assign certain operation (e.g. torch.sparse.mm) as float32 operation in float16 training setting?

Details and what I have tried

I am trying to train a model using

pl.Trainer(distributed_backend='ddp', precision=16, amp_level='01', gpus=2)

and I need to use sparse tensor multiplication in the forward loop. I got RuntimeError: "addmm_sparse_cuda" not implemented for 'Half' as reported in Pytorch issue #41069. However, this error remains even after I changed the variable type into float32.

I guess the apex or pytorch-lightening is still calling the sparse.mm with float16 setting. Is it possible to assign certain operation in the float16 training pipeline as float32 operation? Or if there is any alternative way that I can use torch.sparse.mm within float16 training process.

Reproduce

Initialize any model (e.g. the official MNIST demo), set

trainer = pl.Trainer(distributed_backend='ddp', precision=16, amp_level='01')

add following code in the forward function

a = torch.randn(3,2).float().cuda()
i = torch.LongTensor([[0, 1, 1],  [2, 0, 2]]) 
v = torch.FloatTensor([3, 4, 5]) 
b = torch.sparse.FloatTensor(i, v, torch.Size([2,3])).float().cuda()
c = torch.sparse.mm(b, a)

I cannot afford to do c= b.to_dense() @ a in practice, because of the limited GPU memory.

What's your environment?

  • OS: Ubuntu 16.04
  • Packaging: conda
  • Pytorch: v1.6.0
  • Pytorch_lightning: v0.9.0
  • CUDA: 10.2
@fuy34 fuy34 added the question Further information is requested label Dec 27, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@fuy34
Copy link
Author

fuy34 commented Dec 28, 2020

--------- Update -------
I am not sure if this is the right way to do, but it seems work for me by adding

with torch.cuda.amp.autocast(enabled=False):
    ...<operations>... 

on the top of the operations.

@awaelchli
Copy link
Member

isn't it "O1" for amp level?

For torch native amp, see here a list of ops that can be autocast:
https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float16

@fuy34
Copy link
Author

fuy34 commented Dec 28, 2020

Yes, "01" is for amp level.

I have a CNN model to train, and there is one operation using sparse tensor in the forward loop. More specifically, the model has a self.sparse_tensor variable, and for the feature from the CNN module, I do new_feat = self.sparse_tensor @ feature and give the new_feat to the next CNN module.

The setting I mentioned of pl.Trainer just shows how I train the model. I am not sure if it is related to how the Pytorch-Lightning call the torch.sparse.mm. That is why I present above.

@awaelchli
Copy link
Member

awaelchli commented Dec 29, 2020

Yes, "01" is for amp level.

No, I'm saying it should be "O1" not "01".

PL doesn't convert ops and tensors directly, it relies on either Apex or native torch amp. As you can see in the link I posted, sparse matrix mul is not a supported one (by torch native amp)

@awaelchli
Copy link
Member

When pytorch/pytorch#41069 gets implemented, Lightning will automatically support it.

@fuy34
Copy link
Author

fuy34 commented Dec 29, 2020

Oho, I see. Please excuse me. I do not know why I kept typing "01". It is "O1" for sure.
I think I will temporally do the "dirty" way I mentioned above, while expecting the new feature from torch.
Thank you!

@fuy34 fuy34 closed this as completed Dec 29, 2020
@awaelchli
Copy link
Member

okay, let me know if you run into more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants