[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 #64226

zeroRains · 2024-05-11T13:18:16Z

bug描述 Describe the Bug

在实现sigmoid_cross_entropy_with_logits op的拆解时，用paddle api去实现对应的功能，前向计算得到相同的结果，但是反向计算时产生了精度问题，推测是小算子的自动微分和算子反向计算kernel存在差异。

复现代码如下：

import numpy as np
import paddle

np.random.seed(2023)
paddle.seed(2023)

batch_size = 20
num_classes = 10

x = np.random.uniform(0, 1, (batch_size, num_classes)).astype(
    "float32"
)

lable = np.random.uniform(0, 1, (batch_size, num_classes)).astype(
    "float32"
)

pos_weight = np.random.uniform(0, 1, (batch_size, num_classes)).astype(
    "float32"
)

# pos_weight = np.ones((batch_size, num_classes)).astype("float32")


def fn_ref(x, label, weight):
    out = paddle._C_ops.sigmoid_cross_entropy_with_logits(
        x, label, weight, False, -100)
    loss = out.sum()
    loss.backward()
    return out, x.grad


def fn_comp(x, label, weight):
    zeros = paddle.full((batch_size, num_classes), 0.)
    t1 = paddle.where(x > 0, x, zeros)
    t2 = x * label
    t3 = paddle.log(1 + paddle.exp(-paddle.abs(x)))
    t4 = t1 - t2 + t3 * weight
    t5 = paddle.full((batch_size, num_classes), -100.)
    out = paddle.where(label == t5, zeros, t4)
    loss = out.sum()
    loss.backward()
    return out, x.grad


def cal(fn):
    x1 = paddle.to_tensor(x, stop_gradient=False)
    label1 = paddle.to_tensor(lable)
    pos_weight1 = paddle.to_tensor(pos_weight)
    res = fn(x1, label1, pos_weight1)
    # print(res)
    return res


ref = cal(fn_ref)
actual = cal(fn_comp)


for idx in range(len(ref)):
    np.testing.assert_allclose(ref[idx].numpy(), actual[idx].numpy(
    ), atol=1e-6, rtol=1e-6, err_msg=f"****{idx} index error******")

BUG截图：

其他补充信息 Additional Supplementary Information

目前基本可以判断BUG产生的原因在于pos_weight的引入，当不存在可选参数pos_weight时，默认使用全1的 Tensor 代替，这时候自动微分和kernel反向计算的结果一致，但是当他们不是全1时，结果就会产生偏差。

具体分析如下：
kernel中有关pos_weight部分前向计算的代码：

# paddle/phi/kernels/cpu/sigmoid_cross_entropy_with_logits_grad_kernel.cc L48-L52

      T pos_weight_idx = pos_weight_data == nullptr ? 1 : pos_weight_data[idx];
      T term1 = (x > 0) ? x : 0;
      T term2 = x * label;
      T term3 = std::log(static_cast<T>(1) + std::exp(-std::abs(x)));
      out_data[idx] = term1 - term2 + term3 * pos_weight_idx;

可以用公式表示为：

$$ res = x - x*label + In(1+e^{-x})*posWeight $$

对公式求x的偏导如下：

$$ \frac {\partial_{res}}{\partial_x} = 1-label + \frac{-e^{-x} * posWeight}{1+e^{-x}}\\ =\frac{1+e^{-x}}{1+e^{-x}}-label + \frac{-e^{-x} * posWeight}{1+e^{-x}} \\ =\frac{1+e^{-x}-e^{-x} * posWeight}{1+e^{-x}}-label\\ =\frac{1+(1-posWeight)*e^{-x}}{1+e^{-x}}-label $$

但反向计算的代码如下：

# paddle/phi/kernels/cpu/sigmoid_cross_entropy_with_logits_grad_kernel.cc L50-L52

      T simoid_x = static_cast<T>(1) / (static_cast<T>(1) + std::exp(-x));
      T diff = simoid_x * pos_weight_idx - label;
      dx_data[idx] = dout * diff;

对应的公式如下：

$$ \frac {\partial_{res}}{\partial_x} = \frac{posWeight}{1+e^{-x}} - label $$

所以才会在posWeight不为全1Tensor的时候产生差异，不知道我的分析是否正确，希望能够查看一下。

The text was updated successfully, but these errors were encountered:

zeroRains · 2024-05-13T08:16:39Z

分析部分有点问题，由于在推导过程中忽略了前向计算中使用的std::abs和 T term1 = (x > 0) ? x : 0;的梯度计算，所以现在修改前向计算公式如下：

$$ res = where(x>0,x,0) - x*label + In(1+e^{-|x|})*posWeight $$

经过推导得到的反向梯度计算为：

$$ \frac {\partial_{res}}{\partial_x} = where(x>0 , 1, 0)-label + \frac{-e^{-|x|} * where(x>=0 ,1, -1)* posWeight}{1+e^{-x}} $$

其中where(x>0, 1, 0)是前向计算中T term1 = (x > 0) ? x : 0;的梯度，where(x>=0,1, -1)是std::abs的梯度

对应的修复PR：

fix the bug in sigmoid_cross_entropy_with_logits_grad_kernel #64253

zeroRains · 2024-05-15T06:17:47Z

kernel反向计算的结果，向numpy中采用数值求解的方式(见源码：op_test.py#L148-L323）计算的结果对齐，而拆解算子执行梯度的方式是通过自动微分求解的，其与kernel反向计算结果对齐。推断是Kernel反向实现的计算，存在问题。验证如下：

在执行sigmoid_cross_entropy_with_logits op的TestSigmoidCrossEntropyWithLogitsOp4中，可以观察到相对误差容忍阈值max_relative_error=0.005，设置得比较大，此时当前develop分支对反向kernel的实现可以通过此单测（虽然通过了，但是肉眼可见两个tensor确实有一些不同）

W0515 05:27:49.445072 36810 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 11.2
W0515 05:27:49.449699 36810 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 1.84299020e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  1.03724500e-04, -3.28235993e-04],
       [ 1.30288861e-04, -4.92659295e-04, -9.86970736e-05, ...,
         3.65435473e-04,  4.38155145e-04, -7.09606712e-04],
       [-1.62492418e-04, -4.71084140e-04,  1.10182313e-04, ...,
         1.25990168e-04,  1.87285167e-04, -5.22634377e-04],
       ...,
       [-3.11346327e-05, -3.74937645e-04,  3.26225574e-04, ...,
        -3.02651920e-04, -5.20642019e-04, -4.17576055e-04],
       [-3.08952871e-04,  2.82421633e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  6.31943427e-05, -1.84394142e-04],
       [-1.67904546e-04, -1.19940036e-05, -2.59066405e-04, ...,
         3.36552688e-04,  2.25882243e-05, -9.09629301e-05]])]
max_relative_error : 
0.005
.
----------------------------------------------------------------------
Ran 1 test in 2.453s

OK

但是当我把这个容忍阈值改为max_relative_error=0.0005时，则会得到如下结果。

I0515 06:12:58.692179 17707 program_interpreter.cc:221] New Executor is Running.
I0515 06:12:58.693336 17707 interpreter_util.cc:652] Standalone Executor is Used.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 1.84299020e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  1.03724500e-04, -3.28235993e-04],
       [ 1.30288861e-04, -4.92659295e-04, -9.86970736e-05, ...,
         3.65435473e-04,  4.38155145e-04, -7.09606712e-04],
       [-1.62492418e-04, -4.71084140e-04,  1.10182313e-04, ...,
         1.25990168e-04,  1.87285167e-04, -5.22634377e-04],
       ...,
       [-3.11346327e-05, -3.74937645e-04,  3.26225574e-04, ...,
        -3.02651920e-04, -5.20642019e-04, -4.17576055e-04],
       [-3.08952871e-04,  2.82421633e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  6.31943427e-05, -1.84394142e-04],
       [-1.67904546e-04, -1.19940036e-05, -2.59066405e-04, ...,
         3.36552688e-04,  2.25882243e-05, -9.09629301e-05]])]
max_relative_error : 
0.0005
F
======================================================================
FAIL: test_check_grad (__main__.TestSigmoidCrossEntropyWithLogitsOp4)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/paddle/test/deprecated/legacy_test/test_sigmoid_cross_entropy_with_logits_op.py", line 178, in test_check_grad
    self.check_grad(['X'], 'Out', check_pir=True)
  File "/paddle/build/test/legacy_test/op_test.py", line 2986, in check_grad
    self.check_grad_with_place(
  File "/paddle/build/test/legacy_test/op_test.py", line 3298, in check_grad_with_place
    numeric_grads = self.check_grad_with_place_for_static(
  File "/paddle/build/test/legacy_test/op_test.py", line 3089, in check_grad_with_place_for_static
    self._assert_is_close(
  File "/paddle/build/test/legacy_test/op_test.py", line 2942, in _assert_is_close
    self.assertLessEqual(max_diff, max_relative_error, err_msg())
AssertionError: 0.0007811970012982192 not less than or equal to 0.0005 : Operator sigmoid_cross_entropy_with_logits error, Gradient Check On Place(cpu) variable X (shape: (64, 20), dtype: float64) max gradient diff 7.811970e-04 over limit 5.000000e-04, the first error element is 3, expected 5.481218e-04, but got 2.099690e-05.

----------------------------------------------------------------------
Ran 1 test in 0.521s

FAILED (failures=1)

因此可以推断，是由于容忍阈值比较大，所以使得反向计算错误的问题没有暴露出来。

在修复pr将max_relative_error=0.0005，仍然可以得到相对正确的计算结果，如下图：

W0515 06:13:53.214535 18318 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 11.2
W0515 06:13:53.220155 18318 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
numeric : 
[array([[ 3.77183495e-04, -2.20797240e-04, -3.14791652e-04, ...,
        -2.02644761e-04,  2.45672779e-04, -6.17090210e-06],
       [ 4.84435251e-04,  2.84202716e-04,  1.83931716e-05, ...,
         6.29521346e-04,  5.20318281e-04,  2.33842612e-05],
       [ 3.12574747e-04, -4.71084098e-04,  1.10182442e-04, ...,
         6.98864401e-04,  2.33956572e-04, -7.56920161e-05],
       ...,
       [ 7.15934865e-04, -3.74937504e-04,  3.26225586e-04, ...,
         3.84216391e-05, -5.20641936e-04, -4.17575856e-04],
       [ 1.96946960e-05,  3.88698082e-04, -2.81023718e-04, ...,
        -5.38852117e-05,  3.67850861e-04, -1.84393860e-04],
       [-9.46350590e-05,  1.44749951e-05, -2.59066396e-04, ...,
         5.43415898e-04,  5.17161748e-05,  5.20940836e-04]])]
analytic_grads : 
[array([[ 3.77183699e-04, -2.20797405e-04, -3.14791844e-04, ...,
        -2.02644764e-04,  2.45672821e-04, -6.17087178e-06],
       [ 4.84435362e-04,  2.84202718e-04,  1.83934196e-05, ...,
         6.29521507e-04,  5.20318417e-04,  2.33842613e-05],
       [ 3.12574821e-04, -4.71084140e-04,  1.10182313e-04, ...,
         6.98864469e-04,  2.33956823e-04, -7.56920088e-05],
       ...,
       [ 7.15934877e-04, -3.74937645e-04,  3.26225574e-04, ...,
         3.84217363e-05, -5.20642019e-04, -4.17576055e-04],
       [ 1.96948667e-05,  3.88698345e-04, -2.81023766e-04, ...,
        -5.38854093e-05,  3.67850951e-04, -1.84394142e-04],
       [-9.46347782e-05,  1.44751433e-05, -2.59066405e-04, ...,
         5.43416127e-04,  5.17164533e-05,  5.20940836e-04]])]
max_relative_error : 
0.0005
.
----------------------------------------------------------------------
Ran 1 test in 2.753s

OK

zeroRains · 2024-05-16T11:00:43Z

BUG已修复，详细见PR

fix the bug in sigmoid_cross_entropy_with_logits_grad_kernel #64253

zeroRains added status/new-issue 新建 type/bug-report 报bug labels May 11, 2024

paddle-bot bot assigned cuicheng01 May 11, 2024

zeroRains changed the title ~~sigmoid_cross_entropy_with_logits 算子的拆解计算与kernel调用计算的方向计算结果不一致~~ sigmoid_cross_entropy_with_logits 算子的拆解计算与kernel调用计算的反向计算结果不一致 May 11, 2024

paddle-bot bot added the PFCC Paddle Framework Contributor Club，https://github.com/PaddlePaddle/community/tree/master/pfcc label May 11, 2024

zeroRains mentioned this issue May 13, 2024

fix the bug in sigmoid_cross_entropy_with_logits_grad_kernel #64253

Merged

cyber-pioneer changed the title ~~sigmoid_cross_entropy_with_logits 算子的拆解计算与kernel调用计算的反向计算结果不一致~~ [Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 May 16, 2024

zeroRains closed this as completed May 16, 2024

paddle-bot bot added status/close 已关闭 and removed status/new-issue 新建 labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 #64226

[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 #64226

zeroRains commented May 11, 2024 •

edited

zeroRains commented May 13, 2024

zeroRains commented May 15, 2024

zeroRains commented May 16, 2024

[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 #64226

[Bug Report] sigmoid_cross_entropy_with_logits 算子的小算子自动微分与调用反向kernel的计算结果不一致 #64226

Comments

zeroRains commented May 11, 2024 • edited

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information

zeroRains commented May 13, 2024

zeroRains commented May 15, 2024

zeroRains commented May 16, 2024

zeroRains commented May 11, 2024 •

edited