-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running on Apple GPU (MPS), the loss is always nan. #199
Comments
I ran the test using the same code on another CUDA machine, and the results were completely normal. |
well thats weird |
I have the same problem. Did you solve it? Bro |
No, I don't have any ideas on how to handle this issue. Are you experiencing the exact same problem? |
Yes,when I run the last part of hellokan.The train loss and test loss is nan.The same result of symbolic formula. |
I have the same problem mentioned it in #179 |
Are you using mps? Or cuda? |
I'm using simple apple silicon so can not use cuda |
I’m same with you |
Can you guys try prune your model with |
Bro,I tried. The problem seems to have been solved.Thank you |
1 similar comment
Bro,I tried. The problem seems to have been solved.Thank you |
Problem caused by loss function as author mentioned so lower threshold gives better results but worse prunes |
There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.
Hope it works on Apple GPU. |
I tried running this on a couple of the examples in my CUDA set up, but I got the following error:
|
You may want to add a device parameter using train. The default device setup for train is 'cpu'. |
I tried, but there are still problems. Can you share the code? Thanks. |
I tried your code, but the following error occurred: |
Maybe update your torch version could help for this try |
Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh |
It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case. I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training. |
I updated to version 2.3.0 of PyTorch, but it didn't improve, unfortunately. As for pruning, in fact, none of the units in the model produced qualified results, so pruning had no effect either. |
I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. |
KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan. |
As for the error, I found that it consistently occurs during what should be a completely normal matrix multiplication in one of the loops of the SVD calculation. I casually searched and found many similar issues, such as pytorch/pytorch#113586 and pytorch/pytorch#96153 . It seems to be purely an implementation issue with MPS. If a deep copy is used for the variables calculated around that error, bypassing the mps error, it will be an error in svdestimator: torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 7). The reason is that the parameters passed to the function are all nan. There is no such problem when using the CPU. Initializing with a matrix full of 1 did not improve either. My interim conclusion is that PyTorch's MPS implementation is a very unstable thing, and any problem could potentially occur. |
It works for me in Mac settings. |
I am using an M1 Pro Mac. I have installed the latest version of PyTorch provided by the official for MPS. Previously, when running other models on GPU, I did not encounter similar issues.
When running the example provided in the official documentation at the beginning, if running on CPU, the result is normal; if running on mps, the result is always nan.
The above is my code. When running the code, there may be exceptions during training, and there is also a probability of generating exceptions when drawing the model structure. The error message is
ValueError: alpha (nan) is outside 0-1 range
. Even if there is no error, the plotted graph will have significant differences from CPU (the lines on the image are obviously thinner, and the function graphs inside the nodes fluctuate abnormally).I found a similar issue under issues:
I don't know why but if use MPS(Apple SIlicon) to loss is nan.
Originally posted by @brainer3220 in #98 (comment)
The text was updated successfully, but these errors were encountered: