Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom loss defined as a class instance vs function #19601

Open
mthiboust opened this issue Apr 23, 2024 · 11 comments
Open

Custom loss defined as a class instance vs function #19601

mthiboust opened this issue Apr 23, 2024 · 11 comments
Assignees
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug

Comments

@mthiboust
Copy link

When migrating my keras 2 custom loss to keras 3, I noticed a weird behavior in keras 3. My class-defined loss crashes my jupyter kernel while my function-defined loss is working well. I don't understand what I am doing wrong when subclassing keras.losses.Loss?

This is not working:

class QuantileLoss(losses.Loss):
    def __init__(
        self,
        name: str = "quantile",
        quantile: float = 0.5,
        reduction="sum_over_batch_size",
    ) -> None:
        super().__init__(name=name, reduction=reduction)
        self.quantile = quantile

    def call(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)

model.compile(loss=QuantileLoss(quantile=0.5))

This is working:

def quantile_loss_fn(quantile):
    def fn(y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((quantile * error), (quantile - 1) * error)
        return ops.mean(loss)

    return fn

model.compile(loss=quantile_loss_fn(0.5))

Thanks in advance

@mthiboust mthiboust changed the title Custom loss defined as a class vs function Custom loss defined as a class instance vs function Apr 23, 2024
@mthiboust
Copy link
Author

mthiboust commented Apr 23, 2024

I can make it work with a basic class instance without subclassing keras.losses.Loss:

class QuantileLoss:
    def __init__(self, quantile: float = 0.5):
        self.quantile = quantile

    def __call__(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)

model.compile(loss=QuantileLoss(quantile=0.5))

Is it the way to go since keras 3?

@fchollet
Copy link
Member

The code looks fine, what is the error you encounter?

@mthiboust
Copy link
Author

mthiboust commented Apr 23, 2024

My code is run by a jupyterlab server (using the lastest official docker images jupyter/tensorflow-notebook and jupyter/pytorch-notebook from jupyter/docker-stack) and I connect to it via the vscode-jupypter extension.

The crash is caused by the model.fit() call. It happens within a few seconds when using the torch backend, and a bit later with the tensorflow backend (after a few epochs). But there is no explicit error message I can share with you.

According to this link, the root cause could be a buggy installation of tensorflow/pytorch due to mixing pip and conda packages (jupyter official image installs tensorflow via pip while the other packages are installed via mamba/conda)

@mthiboust
Copy link
Author

I reproduced the bug with the latest tensorflow/tensorflow officiel Docker image with the following code:

Run the official image:

docker run -it --rm tensorflow/tensorflow bash

Install pandas, copy and run the python code:

apt-get update && apt-get install vim
pip install pandas
vim test.py # and then copy and save the code below
python test.py

Python code:

import numpy as np
import pandas as pd

from keras.layers import Dense, Input
from keras.models import Model
from keras.losses import Loss
from keras import ops

class QuantileLoss(Loss):
    def __init__(
        self,
        name: str = "quantile",
        quantile: float = 0.5,
        reduction="sum_over_batch_size",
    ) -> None:
        super().__init__(name=name, reduction=reduction)
        self.quantile = quantile

    def call(self, y_true, y_pred):
        error = y_pred - y_true
        loss = ops.maximum((self.quantile * error), (self.quantile - 1) * error)
        return ops.mean(loss)


X = np.random.random((100000, 100))
y = pd.Series(np.random.random((100000,)))

features = Input(shape=(X.shape[1],))
layers = Dense(200, activation="relu")(features)
labels = Dense(1, activation=None)(layers)

model = Model(features, labels)

model.compile(optimizer="adam", loss=QuantileLoss(quantile=0.5))

model.fit(
    X,
    y.to_numpy(), # Working well with just `y`
    verbose=True,
    epochs=50,
    batch_size=10000,
)

Training time and memory usage is very different depending on the type of the y target:

  1. pd.Series: I have 8 ms/step during training
  2. np.ndarray: I have 600 ms/step with high memory usage (that crashes/freezes my laptop)

My code runs on CPU (i7-9750H) / Ubuntu 23.10 / Docker 24.0.5 with Keras 3.0.5 and Tensorflow 2.16.1

@sachinprasadhs sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. and removed stat:awaiting keras-eng Awaiting response from Keras engineer labels May 6, 2024
@benz0li
Copy link

benz0li commented May 7, 2024

I cannot reproduce with image glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.12.3 (Container: CUDA 12.4.1 + Python 3.12.3).

Cross reference:

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / GPU (Quadro RTX 4000, Compute Capability 7.5) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

@mthiboust
Copy link
Author

This strange behavior may be CPU-specific. Could you reproduce the bug using only the CPU without CUDA?

@benz0li
Copy link

benz0li commented May 7, 2024

This strange behavior may be CPU-specific. Could you reproduce the bug using only the CPU without CUDA?

No. I cannot reproduce with image glcr.b-data.ch/jupyterlab/python/scipy:3.12.3 (Container: Python 3.12.3) on Debian 12 (bookworm) using Docker 26.1.0 either:

Cross reference:

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

@mthiboust
Copy link
Author

Thanks @benz0li for testing it!

@sachinprasadhs : Now that we know that this issue is not reproductible easily, is there something else I should look at and/or test to better diagnose the issue?

@benz0li
Copy link

benz0li commented May 7, 2024

Thanks @benz0li for testing it!

P.S.: On my machine, I cannot reproduce the bug with the latest tensorflow/tensorflow (using CPU) either.

@benz0li
Copy link

benz0li commented May 7, 2024

is there something else I should look at and/or test to better diagnose the issue?

Yes: Output of python test.py, i.e. log files. (Optional: Use latest versions of docker, numpy, keras and pandas)

@mthiboust
Copy link
Author

I confirm that my issue happens on CPU with latest versions of Tensorflow 2.16.1 / Keras 3.3.3 / Numpy 1.26.4 / Pandas 2.2.2. It only happens when using my CPU (it is working well on my GPU with tensorflow/tensorflow:latest-gpu image)

Running test.py with a np.ndarray for the y target:

root@0a0414c2c84b:/# python test.py
2024-05-07 22:02:20.541375: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/50
2024-05-07 22:02:23.140812: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.300856: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.451243: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.637732: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
2024-05-07 22:02:23.816659: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 400000000 exceeds 10% of free system memory.
10/10 ━━━━━━━━━━━━━━━━━━━━ 8s 663ms/step - loss: 0.1569
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 645ms/step - loss: 0.1397
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 657ms/step - loss: 0.1344
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 663ms/step - loss: 0.1313
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 7s 658ms/step - loss: 0.1293
[...]

Running test.py with a pd.Series for the y target:

root@0a0414c2c84b:/# python test.py
2024-05-07 22:04:24.869910: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1597  
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1419 
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1360 
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.1326 
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.1298
[...]

My docker version is 24.0.5. Haven't tested it with latest version of Docker but I could try it next week if necessary.

@divyashreepathihalli divyashreepathihalli added keras-team-review-pending Pending review by a Keras team member. and removed keras-team-review-pending Pending review by a Keras team member. labels May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug
Projects
None yet
Development

No branches or pull requests

5 participants