Custom loss defined as a class instance vs function #19601

mthiboust · 2024-04-23T17:10:36Z

When migrating my keras 2 custom loss to keras 3, I noticed a weird behavior in keras 3. My class-defined loss crashes my jupyter kernel while my function-defined loss is working well. I don't understand what I am doing wrong when subclassing keras.losses.Loss?

This is not working:

class QuantileLoss(losses.Loss):
    def __init__(
        self,
        name: str = "quantile",
        quantile: float = 0.5,
        reduction="sum_over_batch_size",
    ) -> None:
        super().__init__(name=name, reduction=reduction)
        self.quantile = quantile

    def call(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)

model.compile(loss=QuantileLoss(quantile=0.5))

This is working:

def quantile_loss_fn(quantile):
    def fn(y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((quantile * error), (quantile - 1) * error)
        return ops.mean(loss)

    return fn

model.compile(loss=quantile_loss_fn(0.5))

Thanks in advance

The text was updated successfully, but these errors were encountered:

mthiboust · 2024-04-23T17:48:03Z

I can make it work with a basic class instance without subclassing keras.losses.Loss:

class QuantileLoss:
    def __init__(self, quantile: float = 0.5):
        self.quantile = quantile

    def __call__(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)

model.compile(loss=QuantileLoss(quantile=0.5))

Is it the way to go since keras 3?

fchollet · 2024-04-23T17:53:59Z

The code looks fine, what is the error you encounter?

mthiboust · 2024-04-23T20:49:20Z

My code is run by a jupyterlab server (using the lastest official docker images jupyter/tensorflow-notebook and jupyter/pytorch-notebook from jupyter/docker-stack) and I connect to it via the vscode-jupypter extension.

The crash is caused by the model.fit() call. It happens within a few seconds when using the torch backend, and a bit later with the tensorflow backend (after a few epochs). But there is no explicit error message I can share with you.

According to this link, the root cause could be a buggy installation of tensorflow/pytorch due to mixing pip and conda packages (jupyter official image installs tensorflow via pip while the other packages are installed via mamba/conda)

mthiboust · 2024-05-06T14:15:21Z

I reproduced the bug with the latest tensorflow/tensorflow officiel Docker image with the following code:

Run the official image:

docker run -it --rm tensorflow/tensorflow bash

Install pandas, copy and run the python code:

apt-get update && apt-get install vim
pip install pandas
vim test.py # and then copy and save the code below
python test.py

Python code:

import numpy as np
import pandas as pd

from keras.layers import Dense, Input
from keras.models import Model
from keras.losses import Loss
from keras import ops

class QuantileLoss(Loss):
    def __init__(
        self,
        name: str = "quantile",
        quantile: float = 0.5,
        reduction="sum_over_batch_size",
    ) -> None:
        super().__init__(name=name, reduction=reduction)
        self.quantile = quantile

    def call(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)


X = np.random.random((100000, 100))
y = pd.Series(np.random.random((100000,)))

features = Input(shape=(X.shape[1],))
layers = Dense(200, activation="relu")(features)
labels = Dense(1, activation=None)(layers)

model = Model(features, labels)

model.compile(optimizer="adam", loss=QuantileLoss(quantile=0.5))

model.fit(
    X,
    y.to_numpy(), # Working well with just `y`
    verbose=True,
    epochs=50,
    batch_size=10000,
)

Training time and memory usage is very different depending on the type of the y target:

pd.Series: I have 8 ms/step during training
np.ndarray: I have 600 ms/step with high memory usage (that crashes/freezes my laptop)

My code runs on CPU (i7-9750H) / Ubuntu 23.10 / Docker 24.0.5 with Keras 3.0.5 and Tensorflow 2.16.1

benz0li · 2024-05-07T19:30:10Z

I cannot reproduce with image glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.12.3 (Container: CUDA 12.4.1 + Python 3.12.3).

Cross reference:

Kernel crash when using TensorFlow/PyTorch? b-data/jupyterlab-python-docker-stack#8 (comment)

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / GPU (Quadro RTX 4000, Compute Capability 7.5) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

mthiboust · 2024-05-07T19:45:52Z

This strange behavior may be CPU-specific. Could you reproduce the bug using only the CPU without CUDA?

benz0li · 2024-05-07T19:57:02Z

This strange behavior may be CPU-specific. Could you reproduce the bug using only the CPU without CUDA?

No. I cannot reproduce with image glcr.b-data.ch/jupyterlab/python/scipy:3.12.3 (Container: Python 3.12.3) on Debian 12 (bookworm) using Docker 26.1.0 either:

Cross reference:

Kernel crash when using TensorFlow/PyTorch? b-data/jupyterlab-python-docker-stack#8 (comment)

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

mthiboust · 2024-05-07T20:23:27Z

Thanks @benz0li for testing it!

@sachinprasadhs : Now that we know that this issue is not reproductible easily, is there something else I should look at and/or test to better diagnose the issue?

benz0li · 2024-05-07T20:36:00Z

Thanks @benz0li for testing it!

P.S.: On my machine, I cannot reproduce the bug with the latest tensorflow/tensorflow (using CPU) either.

benz0li · 2024-05-07T20:46:50Z

is there something else I should look at and/or test to better diagnose the issue?

Yes: Output of python test.py, i.e. log files. (Optional: Use latest versions of docker, numpy, keras and pandas)

mthiboust · 2024-05-07T22:26:35Z

I confirm that my issue happens on CPU with latest versions of Tensorflow 2.16.1 / Keras 3.3.3 / Numpy 1.26.4 / Pandas 2.2.2. It only happens when using my CPU (it is working well on my GPU with tensorflow/tensorflow:latest-gpu image)

Running test.py with a np.ndarray for the y target:

root@0a0414c2c84b:/# python test.py
2024-05-07 22:02:20.541375: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/50
2024-05-07 22:02:23.140812: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.300856: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.451243: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.637732: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.816659: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
10/10 ━━━━━━━━━━━━━━━━━━━━ 8s 663ms/step - loss: 0.1569
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 645ms/step - loss: 0.1397
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 657ms/step - loss: 0.1344
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 663ms/step - loss: 0.1313
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 658ms/step - loss: 0.1293
[...]

Running test.py with a pd.Series for the y target:

root@0a0414c2c84b:/# python test.py
2024-05-07 22:04:24.869910: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1597  
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1419 
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1360 
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.1326 
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1298
[...]

My docker version is 24.0.5. Haven't tested it with latest version of Docker but I could try it next week if necessary.

github-actions bot assigned sachinprasadhs Apr 23, 2024

mthiboust changed the title ~~Custom loss defined as a class vs function~~ Custom loss defined as a class instance vs function Apr 23, 2024

sachinprasadhs added stat:awaiting response from contributor type:Bug labels Apr 23, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Apr 23, 2024

mthiboust mentioned this issue Apr 23, 2024

Kernel crash when using tensorflow/pytorch notebook image jupyter/docker-stacks#2111

Closed

1 task

sachinprasadhs added the stat:awaiting keras-eng Awaiting response from Keras engineer label Apr 24, 2024

mthiboust mentioned this issue May 6, 2024

Kernel crash when using TensorFlow/PyTorch? b-data/jupyterlab-python-docker-stack#8

Closed

sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. and removed stat:awaiting keras-eng Awaiting response from Keras engineer labels May 6, 2024

mthiboust mentioned this issue May 7, 2024

Why is Keras 3 slower than Keras 2 with the same code? #18872

Closed

divyashreepathihalli added keras-team-review-pending Pending review by a Keras team member. and removed keras-team-review-pending Pending review by a Keras team member. labels May 9, 2024

divyashreepathihalli assigned divyashreepathihalli and unassigned sachinprasadhs May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom loss defined as a class instance vs function #19601

Custom loss defined as a class instance vs function #19601

mthiboust commented Apr 23, 2024

mthiboust commented Apr 23, 2024 •

edited

fchollet commented Apr 23, 2024

mthiboust commented Apr 23, 2024 •

edited

mthiboust commented May 6, 2024

benz0li commented May 7, 2024 •

edited

mthiboust commented May 7, 2024

benz0li commented May 7, 2024

mthiboust commented May 7, 2024

benz0li commented May 7, 2024 •

edited

benz0li commented May 7, 2024

mthiboust commented May 7, 2024

Custom loss defined as a class instance vs function #19601

Custom loss defined as a class instance vs function #19601

Comments

mthiboust commented Apr 23, 2024

mthiboust commented Apr 23, 2024 • edited

fchollet commented Apr 23, 2024

mthiboust commented Apr 23, 2024 • edited

mthiboust commented May 6, 2024

benz0li commented May 7, 2024 • edited

mthiboust commented May 7, 2024

benz0li commented May 7, 2024

mthiboust commented May 7, 2024

benz0li commented May 7, 2024 • edited

benz0li commented May 7, 2024

mthiboust commented May 7, 2024

mthiboust commented Apr 23, 2024 •

edited

mthiboust commented Apr 23, 2024 •

edited

benz0li commented May 7, 2024 •

edited

benz0li commented May 7, 2024 •

edited