- apparently the model is relatively fine with replacing input tokens with
<pad>
even withoutshort_ctx_dropout_p > 0
- refuses to converge on Google's TPUs for some reason
- not even with
torch_xla._XLAC._xla_set_use_full_mat_mul_precision(True)
- not even with