Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom loss function example using kad_op functions #52

Open
yttuncel opened this issue May 8, 2023 · 6 comments
Open

Custom loss function example using kad_op functions #52

yttuncel opened this issue May 8, 2023 · 6 comments

Comments

@yttuncel
Copy link

yttuncel commented May 8, 2023

Hi,

I'm trying to implement a custom loss function with a simple MLP.
Is there an example of using the kad_op functions to accomplish this so that I benefit from automatic differentiation?
I don't want to explicitly write the backward computation as is the case for the currently implemented loss functions (mse, ce, etc).

Or is this approach not feasible (for memory consumption reasons) as it will require the computation and storage of the gradients for each operation in the loss function?

I'd greatly appreciate any help/feedback/example!

Thanks!

@attractivechaos
Copy link
Owner

Does your cost function take a new form or simply combine mse, ce, etc? If the former, you will need a new operator and have to implement backward propagation by yourself. If the latter, you can chain kad_op functions. You may have a look at the implementation of kann_layer_cost() to get an idea:

https://github.com/attractivechaos/kann/blob/f71236a82af2187820fabd9b1aba3138b8a4de04/kann.c#L758C13-L778

You need to provide a truth vector and label nodes with KANN_F_OUT and KANN_F_COST.

@yttuncel
Copy link
Author

yttuncel commented May 9, 2023

Thanks for answering.

I'm trying to implement Hinton's Forward-forward algorithm.
Basically, I have a binary classification problem: $\hat y \in {-1,1}$. My loss function is something like the following:
$L = log(1+exp( \hat y*(\sum h^2 - c) ) )$ where $c$ is a hyperparameter and $h$ is the output activations. The summation happens over the activations (e.g., if the layer has 8 neurons, $h$ is a 1x8 vector, $\sum h^2$ will be scalar and hence loss will be scalar).

I tried to implement this as a layer in the graph as follows:

x_in = kad_feed(2, 1, 128), x_in->ext_flag |= KANN_F_IN;
t = kann_layer_layernorm(x_in), t->ext_label = 11;
t = kad_add(kad_cmul(x_in, w), b), t->ext_label = 12;
t = kad_relu(t), t->ext_label = 13; // Activations
t = kad_reduce_sum(kad_square(t),1); // h.pow(2).sum(1) -->> GOODNESS
t = kad_log(kad_add(kad_exp(kad_sub(t, kann_new_leaf(KAD_CONST, c, 0))), kann_new_leaf(KAD_CONST, 1.0, 0))), t->ext_label = 18; // log(1+exp(h.pow(2).sum(1) - theta)) -- >> POSITIVE LOSS

I have two questions:
1- Right now, this is missing the label information, i.e., it only covers half of the cost function. I'm not sure how to integrate $\hat y$. How should the necessary kad_mul take in y_hat?
2- Even if I did this, this adds many nodes to the graph (which increases the memory footprint). So even though this approach handles the derivative computation of the loss function, writing an operator appears to be the more logical thing from the efficiency point of view. Is there a way I can use the kad_op functions in the new operator I define? I guess not, as they expect kad_node_t as input.

Thus, my attempt at writing the operator:

int kad_op_ffloss(kad_node_t *p, int action)
{
	int i, n;
	float c = 4.0;
	kad_node_t *y1 = p->child[0]; /* test */
	kad_node_t *y0 = p->child[1]; /* truth */ // I need to bind y0 as the truth label in the graph
	n = kad_len(y0);
	float label;
        label = y0->x[0]; // Check what reaches here. One sample or a batch of samples?
	if (action == KAD_SYNC_DIM) {
		if (n != kad_len(y1)) return -1;
		p->n_d = 0;
	} else if (action == KAD_FORWARD) {
		float cost = 0.0;
		float goodness = 0.0;
		for (i = 0; i < n; ++i)
			goodness += y1->x[i]*y1->x[i];
		cost = log(1 + expf(label*(goodness - c)) ); //POSITIVE: label=1, NEGATIVE: label=-1
		p->x[0] = (float)(cost);
	} else if (action == KAD_BACKWARD && kad_is_back(y1)) {
		for (i = 0; i < n; ++i)
	            //WORKING ON THIS PART NOW....
                    y1->g[i] += label > 0 ? (p->g[i] * (1.0f / (1.0f + expf(p->x[i] - c)))) : (p->g[i] * (1.0f / (1.0f + expf(-1*p->x[i] + c))));
	}
	return 0;
}

I'm now trying to figure out the derivative of the loss function for the backward computation.
How can I define the derivative of:

for (i = 0; i < n; ++i)
    goodness += y1->x[i]*y1->x[i];

?

@attractivechaos
Copy link
Owner

In Hinton's preprint, it seems that he is using logistic function. Why do you choose log(1+exp)? It could go to infinity in theory?

@yttuncel
Copy link
Author

yttuncel commented May 10, 2023

Yes, to be frank, I've seen log(1+exp) in an implementation I've found on github. I compared it with 1/(1+exp) and the log(1+exp) is working better for my dataset. I have a working python implementation for both.
I think the people that implemented it preferred log(1+exp) since its derivative is nice? Just my guess.
Also, because of all of this noise in the ML community I think of a hundred different functions when I see the terms logistic function, logistic loss, sigmoid, log loss etc.

Anyways, in theory yes it can go to infinity, but in practice having the negative samples limit that behavior.

I've made some progress today with the loss. I can compute the forward loss correctly (matches with the reference python implementation). Now I'm working on the backward pass. I'm stuck on the $\partial goodness / \partial h$ term still ($goodness=\sum h^2$). I think I can reuse your backwards implementation in kad_op_reduce_sum for that. Right?

@attractivechaos
Copy link
Owner

If I understand correctly, $\sum h^2-c$ is the output. If that is correct, you may build a model with the code at the end. You need to manually label output, truth and cost nodes as you are not using the kann_layer_cost() function. I haven't tested the code. Not sure if it works.

If you want to implement on your own, you may consider to implement an operator for $\log(1+\exp(\hat{y}\cdot y))$ and let kann handle the backward of $\sum h^2$.

kann_t *model_gen(float c_val)
{
    kad_node_t *x, *h2, *c, *y, *t, *cost;
    x_in = kad_feed(2, 1, 128), x_in->ext_flag |= KANN_F_IN;
    t = kann_layer_layernorm(x_in), t->ext_label = 11;
    t = kad_add(kad_cmul(x_in, w), b), t->ext_label = 12;
    t = kad_relu(t), t->ext_label = 13; // Activations
    h2 = kad_reduce_sum(kad_square(t),1); // \sum h^2
    c = kann_new_scalar(KAD_CONST, c_val), c->ext_label = 31; // a new constant
    t = kad_sub(h2, c), t->ext_label |= KANN_F_OUT; // \sum h^2 - c
    y = kad_feed(2, 1, 1), y->ext_flag |= KANN_F_TRUTH; // truth
    t = kad_exp(kad_cmul(y, t)); // exp(y * (\sum h^2 - c))
    cost = kad_log(kad_add(kann_new_scalar(KAD_CONST, 1.0), t)); // the entire thing
    cost->ext_flag |= KANN_F_COST;
    return kann_new(cost, 0);
}

@yttuncel
Copy link
Author

Hi again,

I have a working implementation of BP and FF for my dataset.
Now, I'm trying to get a count of the total allocated memory for training.

For example, I have an example 3-layer MLP with 100 neurons in each layer. My input vector has a length of 120. I compile the application in debug mode, and iterate over the layers using the kad_print_graph function. At the end of each line, I added malloc_usable_size(p->x) and malloc_usable_size(p->g) to get the size of the allocated memory for that pointer. I get something like this:

0       0:1     0       .       [1,120] feed    x: 0 B  g: 0 B
1       1:0     0       .       [100,120]       var     x: 0 B  g: 135152 B
2       1:0     0       .       [1,100] cmul($0,$1)     x: 408 B        g: 408 B
3       1:0     0       .       [100]   var     x: 0 B  g: 0 B
4       1:0     0       .       [1,100] add($2,$3)      x: 408 B        g: 408 B
5       1:0     0       .       [1,100] relu($4)        x: 408 B        g: 408 B
6       1:0     0       .       [100,100]       var     x: 0 B  g: 0 B
7       1:0     0       .       [1,100] cmul($5,$6)     x: 408 B        g: 408 B
8       1:0     0       .       [100]   var     x: 0 B  g: 0 B
9       1:0     0       .       [1,100] add($7,$8)      x: 408 B        g: 408 B
10      1:0     0       .       [1,100] relu($9)        x: 408 B        g: 408 B
11      1:0     0       .       [100,100]       var     x: 0 B  g: 0 B
12      1:0     0       .       [1,100] cmul($10,$11)   x: 408 B        g: 408 B
13      1:0     0       .       [100]   var     x: 0 B  g: 0 B
14      1:0     0       .       [1,100] add($12,$13)    x: 408 B        g: 408 B
15      1:0     0       .       [1,100] relu($14)       x: 408 B        g: 408 B
16      1:0     0       .       [8,100] var     x: 0 B  g: 0 B
17      1:0     0       .       [1,8]   cmul($15,$16)   x: 40 B g: 40 B
18      1:0     0       .       [8]     var     x: 0 B  g: 0 B
19      1:0     0       .       [1,8]   add($17,$18)    x: 40 B g: 40 B
20      1:32    0       .       [1,8]   softmax($19)    x: 40 B g: 40 B
21      0:4     0       .       [1,8]   feed    x: 0 B  g: 0 B
22      1:8     0       .       []      ce_multi($20,$21)       x: 24 B g: 24 B

Here, I see two oddities:
Line 1: x shows as 0B because it was freed in kann_new. Does this mean x's for all layers are collated in net->x? What about the rest of the layers above that have non-zero x's?
Lines 6 and 16: g is 0 for these. Where are the gradients for these layers stored?

In short, I'd like to estimate the memory footprint for training this network. Am I on the right track?
I feel so close to the end, I'd appreciate the tiniest bit of help to guide me in the right direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants