Need Help with a Softmax Warning in TensorFlow 2.16 #67758

sp00N221 · 2024-05-16T18:35:33Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16.1

Custom code

Yes

OS platform and distribution

Windows 11

Mobile device

No response

Python version

3.12.3

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Hey everyone,
I'm running into a bit of a headache with TensorFlow 2.16 and could really use some help. I'm getting this annoying warning about a Softmax operation over an axis with size 1. This pops up when I'm using a custom TransformerBlock layer that includes MultiHeadAttention.

What I've Tried:
Debugging Dimensions:

Added print statements to check tensor shapes at different stages.
Used tf.squeeze to remove dimensions of size 1 before passing the tensor to MultiHeadAttention.

What I Need:
Is this a bug in TensorFlow 2.16? If yes, any workarounds or patches?
Best practices for handling tensor dimensions in MultiHeadAttention to avoid this?
Should I downgrade or wait for an update? If yes, which version should I try?

Additional Info:
Using LSTM and GRU layers followed by the custom TransformerBlock.
Running on Windows with Python 3.12.

Any help or pointers would be greatly appreciated! Thanks!

Standalone code to reproduce the issue

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, t_num_heads, t_key_dim, t_ff_dim, dropout_rate=0.1, activation_function='relu',
                 initializer='glorot_uniform', **kwargs):
        super(TransformerBlock, self).__init__(**kwargs)
        self.att = MultiHeadAttention(num_heads=t_num_heads, key_dim=t_key_dim)
        self.ffn = tf.keras.Sequential([
            Dense(t_ff_dim, activation=activation_function, kernel_initializer=initializer),
            Dense(t_key_dim, kernel_initializer=initializer),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)
        self.dense_proj = Dense(t_key_dim, kernel_initializer=initializer)

    def call(self, inputs, training=None, *args, **kwargs):
        inputs_proj = self.dense_proj(inputs)
        print(f"inputs_proj shape: {inputs_proj.shape}")

        if len(inputs_proj.shape) == 4 and inputs_proj.shape[2] == 1:
            inputs_proj = tf.squeeze(inputs_proj, axis=2)
            print(f"inputs_proj after squeeze shape: {inputs_proj.shape}")

        attn_output = self.att(inputs_proj, inputs_proj)
        print(f"attn_output shape: {attn_output.shape}")
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs_proj + attn_output)
        ffn_output = self.ffn(out1)
        print(f"ffn_output shape: {ffn_output.shape}")
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def compute_output_shape(self, input_shape):
        return input_shape

    def get_config(self):
        config = super(TransformerBlock, self).get_config()
        config.update({
            't_num_heads': self.att.num_heads,
            't_key_dim': self.att.key_dim,
            't_ff_dim': self.ffn.layers[0].units,
            'dropout_rate': self.dropout1.rate,
            'activation_function': self.ffn.layers[0].activation.__name__,
            'initializer': self.ffn.layers[0].kernel_initializer.__class__.__name__
        })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(**config)

Relevant log output

UserWarning: You are using a softmax over axis 3 of a tensor of shape (None, 4, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?

CasualMathEnjoyer · 2024-05-18T18:23:52Z

I am experiencing the same issue when implementing my own transformer encoder decoder. So far, i am still missing positional encoding and some masking layers, i don't know whether those would affect it in any way.

Here is my code, showing the same warning with tf-nightly, and python 3.11.9

import keras
import numpy as np

def possitionalEmbedding(input_dim, output_dim):  # TODO
    return keras.layers.Embedding(input_dim=input_dim, output_dim=output_dim)

def model_func(encoder_vocab_len, decoder_vocab_len, encoder_maxlen, decoder_maxlen, params):
    num_heads, key_dim, d_v, d_ff, d_model, n = params

    encoder_input = keras.Input(shape=(None,))
    decoder_input = keras.Input(shape=(None,))

    # encoder part
    embedded = possitionalEmbedding(encoder_vocab_len, d_model)(encoder_input) # todo possitional embedding
    embedded = keras.layers.Dropout(0.1)(embedded)

    encoded = embedded
    for i in range(n):
        attended_encoded = keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))(encoded, encoded, encoded)  # todo padding_mask
        attended_encoded_d = keras.layers.Dropout(0.1)(attended_encoded)
        add = encoded + attended_encoded_d
        normalised = keras.layers.LayerNormalization()(add)
        fed_f = keras.layers.Dense(d_ff)(normalised)  # feed forward 1 part
        fed_ff = keras.layers.Dense(d_model)(keras.activations.relu(fed_f))  # feed forward 2 part
        fed_ff_d = keras.layers.Dropout(0.1)(fed_ff)

        add2 = normalised + fed_ff_d
        normalised2 = keras.layers.LayerNormalization()(add2)

        encoded = normalised2  # and the loop is repeated

    encoder_output = encoded  # output from encoder

    # decoder part
    de_embed = possitionalEmbedding(decoder_vocab_len, d_model)(decoder_input)
    de_embed = keras.layers.Dropout(0.1)(de_embed)

    for i in range(n):
        self_attention = (keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))
                            (de_embed, de_embed, de_embed))
        self_attention_d = keras.layers.Dropout(0.1)(self_attention)
        add = de_embed + self_attention_d
        normalised1 = keras.layers.LayerNormalization()(add)
        cross_attention = (keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))
                           (normalised1, encoder_output,encoder_output))
        cross_attention_d = keras.layers.Dropout(0.1)(cross_attention)

        add2 = normalised1 + cross_attention_d
        normalised2 = keras.layers.LayerNormalization()(add2)

        fed_f = keras.layers.Dense(d_ff)(normalised2)  # feed forward 1 part
        fed_ff = keras.layers.Dense(d_model)(keras.activations.relu(fed_f))  # feed forward 2 part
        fed_ff_d = keras.layers.Dropout(0.1)(fed_ff)

        add3 = normalised2 + fed_ff_d
        normalised3 = keras.layers.LayerNormalization()(add3)

        de_embed = normalised3

    decoder_dense_output = keras.layers.Dense(decoder_vocab_len, activation='softmax', name='decoder_output')(de_embed)

    return keras.Model(inputs=[encoder_input, decoder_input], outputs=decoder_dense_output)

if __name__ == '__main__':
    params = (8, 64, 64, 256, 512, 6)
    model = model_func(10000, 10000, 100, 100, params)
    model.summary()

    # Generate random input data with appropriate shapes
    encoder_input_data = np.random.randint(0, 10000, (2, 1))  # (batch_size, sequence_length)
    decoder_input_data = np.random.randint(0, 10000, (2, 4))  # (batch_size, sequence_length)

    # Call the model with the random input data
    output = model.call([encoder_input_data, decoder_input_data], training=False)

    # Print the shape of the output
    print(f'Output shape: {output.shape}')

UserWarning: You are using a softmax over axis 3 of a tensor of shape (2, 8, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(
UserWarning: You are using a softmax over axis 3 of a tensor of shape (2, 8, 4, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(

Output shape: (2, 4, 10000)

Venkat6871 · 2024-05-21T05:56:03Z

Hi @sp00N221 ,

Sorry for the delay, Can you please check with recent TF compatibility versions? I tried with TF2.16.1 and I cannot reproduce the error.
Please check the screenshot
here. Thanks!

sp00N221 · 2024-05-21T10:52:05Z

Hey,

Thank you for taking the time to review my issue. I've had nothing but problems with my task over the past few days. I had a combination of a TransformerBlock and LSTM layers. Coupled with Optuna, it was probably just too many variables and possibilities, causing the model to become unstable. I have now switched to this task:

def objective(trial, features, target):
n_estimators = trial.suggest_int('n_estimators', 50, 300)
max_depth = trial.suggest_int('max_depth', 3, 15)
learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
subsample = trial.suggest_float('subsample', 0.5, 1.0)
colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 1.0)
gamma = trial.suggest_float('gamma', 0, 5)
min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
reg_lambda = trial.suggest_float('lambda', 1e-8, 10.0, log=True)
reg_alpha = trial.suggest_float('alpha', 1e-8, 10.0, log=True)

model = XGBClassifier(
    n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate,
    subsample=subsample, colsample_bytree=colsample_bytree, gamma=gamma,
    min_child_weight=min_child_weight, reg_lambda=reg_lambda, reg_alpha=reg_alpha,
    random_state=42
)

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

numeric_features = x_train.columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]),
         numeric_features)
    ])

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

model.set_params(early_stopping_rounds=10, eval_metric='logloss')
model.fit(x_train, y_train, eval_set=[(x_test, y_test)], verbose=False)

predictions = model.predict(x_test)
accuracy = accuracy_score(y_test, predictions)

return accuracy

With this, I have no problems.
Have a nice day!

google-ml-butler bot added the type:bug Bug label May 16, 2024

google-ml-butler bot assigned Venkat6871 May 16, 2024

Venkat6871 added TF 2.16 subtype:windows Windows Build/Installation Issues labels May 20, 2024

Venkat6871 added the stat:awaiting response Status - Awaiting response from author label May 21, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

sp00N221 commented May 16, 2024

CasualMathEnjoyer commented May 18, 2024 •

edited

Venkat6871 commented May 21, 2024 •

edited

sp00N221 commented May 21, 2024

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

Comments

sp00N221 commented May 16, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

CasualMathEnjoyer commented May 18, 2024 • edited

Venkat6871 commented May 21, 2024 • edited

sp00N221 commented May 21, 2024

CasualMathEnjoyer commented May 18, 2024 •

edited

Venkat6871 commented May 21, 2024 •

edited