You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
this is me being a bit paranoid but in test_gpt2.cu we check that our code agrees with pytorch reference. we're using a single global threshold for all comparisons of 1e-2. we could instead compare the gradients on the parameters parameter by parameter, and tune this amount to be per-parameter as low as we can make it, maybe eyeballing a plus ~10% buffer. otherwise my concern is that one global 1e-2 could be too large for some of these parameter gradients in absolute terms, and we could be making silent errors with new kernels. when we "trip the wire" with a new kernel, we should inspect manually and carefully that things are ok despite tripping the check, and it's okay to increase the bound.
the code for checking all parameters is already there, but commented out.
would welcome a PR that digs into this on per-parameter basis and looks at what thresholds we can get away with in this comparison.
The text was updated successfully, but these errors were encountered:
adding a todo
this is me being a bit paranoid but in test_gpt2.cu we check that our code agrees with pytorch reference. we're using a single global threshold for all comparisons of
1e-2
. we could instead compare the gradients on the parameters parameter by parameter, and tune this amount to be per-parameter as low as we can make it, maybe eyeballing a plus ~10% buffer. otherwise my concern is that one global 1e-2 could be too large for some of these parameter gradients in absolute terms, and we could be making silent errors with new kernels. when we "trip the wire" with a new kernel, we should inspect manually and carefully that things are ok despite tripping the check, and it's okay to increase the bound.the code for checking all parameters is already there, but commented out.
would welcome a PR that digs into this on per-parameter basis and looks at what thresholds we can get away with in this comparison.
The text was updated successfully, but these errors were encountered: