Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: successor.local_gradient_for_argument(self) could be corrupted #52

Open
Banyc opened this issue Jul 3, 2022 · 4 comments
Open

Comments

@Banyc
Copy link

Banyc commented Jul 3, 2022

Issue

local_gradient_for_argument might read a corrupted float value from the successor.

Say we have the computation graph:

    h
    |
    |
    f
  • $h$ has parameters $w$
  • $f$ has parameters $v$
  • $h : \mathbb{R}^m \to \mathbb{R}$
  • $f : \mathbb{R}^n \to \mathbb{R}$
  • $h$ is the successor of $f$
  • $E$ is the graph
  • $\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$
  • $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
  • $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
  • when doing backpropagation, the steps will be
    1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$, and $\frac{\partial h}{\partial w}$
    2. $h$ updates $w$ to $w'$
    3. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
      1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
      2. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        • This is the problem!
        • $\frac{\partial h}{\partial f}$ is corrupted
      3. $\frac{\partial h}{\partial f}$ is in cache now
      4. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
      5. $\frac{\partial E}{\partial f}$ is in cache now
    4. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
    5. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the corrupted $\frac{\partial h}{\partial f}$
    6. $f$ updates $v$ based on the corrupted $\frac{\partial E}{\partial v}$

Solutions

I can come up with two solutions

  • compute local_gradient ($\frac{\partial h}{\partial f}$) at the beginning of do_gradient_descent_step before parameters ($w$) is modified
  • the successor $h$ distributes local_gradient ($\frac{\partial h}{\partial f}$) and global_gradient ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified

PS

Thank you for your book and the sample code so that I could deeper understand the neural network!

@Banyc
Copy link
Author

Banyc commented Jul 3, 2022

The math notations are going weird. I post the raw markdown here

# Issue

[local_gradient_for_argument](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L125) might read a corrupted float value from the successor.

Say we have the computation graph:

\`\`\`text
    h
    |
    |
    f
\`\`\`

- $h$ has parameters $w$
- $f$ has parameters $v$
- $h : \mathbb{R}^m \to \mathbb{R}$
- $f : \mathbb{R}^n \to \mathbb{R}$
- $h$ is the successor of $f$
- $E$ is the graph
- $\frac{\partial E}{\partial w} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w}$
- $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
- $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
- when doing backpropagation, the steps will be
  1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial w}$
  1. $h$ updates $w$ to $w'$
  1. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
     1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
     1. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        - **This is the problem!**
        - $\frac{\partial h}{\partial f}$ is **corrupted**
     1. $\frac{\partial h}{\partial f}$ is in cache now
     1. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
     1. $\frac{\partial E}{\partial f}$ is in cache now
  1. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
  1. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the **corrupted** $\frac{\partial h}{\partial f}$
  1. $f$ updates $v$ based on the **corrupted** $\frac{\partial E}{\partial v}$

# Solutions

I can come up with two solutions

- compute `local_gradient` ($\frac{\partial h}{\partial f}$) at the beginning of [do_gradient_descent_step](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L107) before parameters ($w$) is modified
- the successor $h$ distributes `local_gradient` ($\frac{\partial h}{\partial f}$) and `global_gradient` ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified
  - I have also coded a neural network based on yours, the distribution happens at [distribute_global_gradient_entries_to_operands](https://github.com/Banyc/neural_network/blob/c6970d2712c8c01286b0a10434723a073fb99ad7/src/nodes/node.rs#L116)

@j2kun
Copy link
Collaborator

j2kun commented Jul 26, 2022

I have been slow to respond to this comment because it is quite detailed and I want to make sure I give it the time it deserves.

For starters, one cannot compose f and h because the output of f is a single number but the input of h is a vector of length m. But I think the rest of your statement does not differ if h and f are single-input functions, so that can be ignored.

Next, I believe the structure of the code ensures that all gradients are computed before any update steps occur.

I believe you can see this by applying the following patch (attached as a file as well), and running any of the integration tests in neural_network_integration_test.py, e.g., with pytest -s -k test_learn_xor_sigmoid neural_network_integration_test.py
corrupted_patch.txt

diff --git a/neural_network/neural_network.py b/neural_network/neural_network.py
index cee6a4a..db639e1 100644
--- a/neural_network/neural_network.py
+++ b/neural_network/neural_network.py
@@ -107,9 +107,11 @@ class Node:
     def do_gradient_descent_step(self, step_size):
         '''The core gradient step subroutine: compute the gradient for each of this node's
         tunable parameters, step away from the gradient.'''
+        print("Starting one gradient descent step")
         if self.has_parameters:
             for i, gradient_entry in enumerate(self.global_parameter_gradient):
                 # step away from the gradient
+                print("Doing gradient update for parameter %s" % i)
                 self.parameters[i] -= step_size * gradient_entry

     '''Gradient computations which don't depend on the node's definition.'''
@@ -147,6 +149,7 @@ class Node:
     @property
     def local_gradient(self):
         if self.cache.local_gradient is None:
+            print("Computing local gradient for %s" % (self, ))
             self.cache.local_gradient = self.compute_local_gradient()
         return self.cache.local_gradient

@@ -159,6 +162,7 @@ class Node:
     @property
     def local_parameter_gradient(self):
         if self.cache.local_parameter_gradient is None:
+            print("Computing local parameter gradient for %s" % (self, ))
             self.cache.local_parameter_gradient = self.compute_local_parameter_gradient()
         return self.cache.local_parameter_gradient

@@ -391,6 +395,7 @@ class NeuralNetwork:
         return self.error_node.compute_error(inputs, label)

     def backpropagation_step(self, inputs, label, step_size=None):
+        print("Doing one backpropagation step")
         self.compute_error(inputs, label)
         self.for_each(lambda node: node.do_gradient_descent_step(step_size))

The test output should show that all gradient computations occur before all parameter updates. E.g., here is what I see in one gradient descent step of the xor example:

Doing one backpropagation step
Starting one gradient descent step
Starting one gradient descent step
Starting one gradient descent step
Computing local gradient for <neural_network.L2ErrorNode object at 0x7fb0b9e35748>
Computing local gradient for <neural_network.SigmoidNode object at 0x7fb0b9e356d8>
Computing local parameter gradient for <neural_network.LinearNode object at 0x7fb0b9e355c0>
Doing gradient update for parameter 0
Doing gradient update for parameter 1
Doing gradient update for parameter 2
Doing gradient update for parameter 3

The mnist example looks similar to me.

It's possible this issue is still occurring, and I'm too dense to see it. Perhaps you could try to formulate your issue as a failing test case? Or else, just an example network in the code with print statements that demonstrate the issue?

@Banyc
Copy link
Author

Banyc commented Jul 26, 2022

@Banyc
Copy link
Author

Banyc commented Jul 26, 2022

Also, thanks for the instruction on the pytest! This is the first time I use that tool and the guide helps me well! Your print method is also great! But I think a test case and the pytest is enough to reproduce the bug, so I did't print out the checkpoints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants