Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

Open
sp00N221 opened this issue May 16, 2024 · 3 comments
Open

Need Help with a Softmax Warning in TensorFlow 2.16 #67758

sp00N221 opened this issue May 16, 2024 · 3 comments
Assignees
Labels
subtype:windows Windows Build/Installation Issues TF 2.16 type:bug Bug

Comments

@sp00N221
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16.1

Custom code

Yes

OS platform and distribution

Windows 11

Mobile device

No response

Python version

3.12.3

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Hey everyone,
I'm running into a bit of a headache with TensorFlow 2.16 and could really use some help. I'm getting this annoying warning about a Softmax operation over an axis with size 1. This pops up when I'm using a custom TransformerBlock layer that includes MultiHeadAttention.

What I've Tried:
Debugging Dimensions:

Added print statements to check tensor shapes at different stages.
Used tf.squeeze to remove dimensions of size 1 before passing the tensor to MultiHeadAttention.

What I Need:
Is this a bug in TensorFlow 2.16? If yes, any workarounds or patches?
Best practices for handling tensor dimensions in MultiHeadAttention to avoid this?
Should I downgrade or wait for an update? If yes, which version should I try?

Additional Info:
Using LSTM and GRU layers followed by the custom TransformerBlock.
Running on Windows with Python 3.12.

Any help or pointers would be greatly appreciated! Thanks!

Standalone code to reproduce the issue

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, t_num_heads, t_key_dim, t_ff_dim, dropout_rate=0.1, activation_function='relu',
                 initializer='glorot_uniform', **kwargs):
        super(TransformerBlock, self).__init__(**kwargs)
        self.att = MultiHeadAttention(num_heads=t_num_heads, key_dim=t_key_dim)
        self.ffn = tf.keras.Sequential([
            Dense(t_ff_dim, activation=activation_function, kernel_initializer=initializer),
            Dense(t_key_dim, kernel_initializer=initializer),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)
        self.dense_proj = Dense(t_key_dim, kernel_initializer=initializer)

    def call(self, inputs, training=None, *args, **kwargs):
        inputs_proj = self.dense_proj(inputs)
        print(f"inputs_proj shape: {inputs_proj.shape}")

        if len(inputs_proj.shape) == 4 and inputs_proj.shape[2] == 1:
            inputs_proj = tf.squeeze(inputs_proj, axis=2)
            print(f"inputs_proj after squeeze shape: {inputs_proj.shape}")

        attn_output = self.att(inputs_proj, inputs_proj)
        print(f"attn_output shape: {attn_output.shape}")
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs_proj + attn_output)
        ffn_output = self.ffn(out1)
        print(f"ffn_output shape: {ffn_output.shape}")
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def compute_output_shape(self, input_shape):
        return input_shape

    def get_config(self):
        config = super(TransformerBlock, self).get_config()
        config.update({
            't_num_heads': self.att.num_heads,
            't_key_dim': self.att.key_dim,
            't_ff_dim': self.ffn.layers[0].units,
            'dropout_rate': self.dropout1.rate,
            'activation_function': self.ffn.layers[0].activation.__name__,
            'initializer': self.ffn.layers[0].kernel_initializer.__class__.__name__
        })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(**config)

Relevant log output

UserWarning: You are using a softmax over axis 3 of a tensor of shape (None, 4, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
@CasualMathEnjoyer
Copy link

CasualMathEnjoyer commented May 18, 2024

I am experiencing the same issue when implementing my own transformer encoder decoder. So far, i am still missing positional encoding and some masking layers, i don't know whether those would affect it in any way.

Here is my code, showing the same warning with tf-nightly, and python 3.11.9

import keras
import numpy as np

def possitionalEmbedding(input_dim, output_dim):  # TODO
    return keras.layers.Embedding(input_dim=input_dim, output_dim=output_dim)

def model_func(encoder_vocab_len, decoder_vocab_len, encoder_maxlen, decoder_maxlen, params):
    num_heads, key_dim, d_v, d_ff, d_model, n = params

    encoder_input = keras.Input(shape=(None,))
    decoder_input = keras.Input(shape=(None,))

    # encoder part
    embedded = possitionalEmbedding(encoder_vocab_len, d_model)(encoder_input) # todo possitional embedding
    embedded = keras.layers.Dropout(0.1)(embedded)

    encoded = embedded
    for i in range(n):
        attended_encoded = keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))(encoded, encoded, encoded)  # todo padding_mask
        attended_encoded_d = keras.layers.Dropout(0.1)(attended_encoded)
        add = encoded + attended_encoded_d
        normalised = keras.layers.LayerNormalization()(add)
        fed_f = keras.layers.Dense(d_ff)(normalised)  # feed forward 1 part
        fed_ff = keras.layers.Dense(d_model)(keras.activations.relu(fed_f))  # feed forward 2 part
        fed_ff_d = keras.layers.Dropout(0.1)(fed_ff)

        add2 = normalised + fed_ff_d
        normalised2 = keras.layers.LayerNormalization()(add2)

        encoded = normalised2  # and the loop is repeated

    encoder_output = encoded  # output from encoder

    # decoder part
    de_embed = possitionalEmbedding(decoder_vocab_len, d_model)(decoder_input)
    de_embed = keras.layers.Dropout(0.1)(de_embed)

    for i in range(n):
        self_attention = (keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))
                            (de_embed, de_embed, de_embed))
        self_attention_d = keras.layers.Dropout(0.1)(self_attention)
        add = de_embed + self_attention_d
        normalised1 = keras.layers.LayerNormalization()(add)
        cross_attention = (keras.layers.MultiHeadAttention(num_heads,
                                        key_dim,
                                        dropout=0.1,
                                        use_bias=True,
                                        output_shape=(d_model,))
                           (normalised1, encoder_output,encoder_output))
        cross_attention_d = keras.layers.Dropout(0.1)(cross_attention)

        add2 = normalised1 + cross_attention_d
        normalised2 = keras.layers.LayerNormalization()(add2)

        fed_f = keras.layers.Dense(d_ff)(normalised2)  # feed forward 1 part
        fed_ff = keras.layers.Dense(d_model)(keras.activations.relu(fed_f))  # feed forward 2 part
        fed_ff_d = keras.layers.Dropout(0.1)(fed_ff)

        add3 = normalised2 + fed_ff_d
        normalised3 = keras.layers.LayerNormalization()(add3)

        de_embed = normalised3

    decoder_dense_output = keras.layers.Dense(decoder_vocab_len, activation='softmax', name='decoder_output')(de_embed)

    return keras.Model(inputs=[encoder_input, decoder_input], outputs=decoder_dense_output)

if __name__ == '__main__':
    params = (8, 64, 64, 256, 512, 6)
    model = model_func(10000, 10000, 100, 100, params)
    model.summary()

    # Generate random input data with appropriate shapes
    encoder_input_data = np.random.randint(0, 10000, (2, 1))  # (batch_size, sequence_length)
    decoder_input_data = np.random.randint(0, 10000, (2, 4))  # (batch_size, sequence_length)

    # Call the model with the random input data
    output = model.call([encoder_input_data, decoder_input_data], training=False)

    # Print the shape of the output
    print(f'Output shape: {output.shape}')

UserWarning: You are using a softmax over axis 3 of a tensor of shape (2, 8, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(
UserWarning: You are using a softmax over axis 3 of a tensor of shape (2, 8, 4, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(

Output shape: (2, 4, 10000)

@Venkat6871 Venkat6871 added TF 2.16 subtype:windows Windows Build/Installation Issues labels May 20, 2024
@Venkat6871
Copy link

Venkat6871 commented May 21, 2024

Hi @sp00N221 ,

  • Sorry for the delay, Can you please check with recent TF compatibility versions? I tried with TF2.16.1 and I cannot reproduce the error.
    Please check the screenshot
    test1 here. Thanks!

@Venkat6871 Venkat6871 added the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@sp00N221
Copy link
Author

Hey,

Thank you for taking the time to review my issue. I've had nothing but problems with my task over the past few days. I had a combination of a TransformerBlock and LSTM layers. Coupled with Optuna, it was probably just too many variables and possibilities, causing the model to become unstable. I have now switched to this task:

def objective(trial, features, target):
n_estimators = trial.suggest_int('n_estimators', 50, 300)
max_depth = trial.suggest_int('max_depth', 3, 15)
learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
subsample = trial.suggest_float('subsample', 0.5, 1.0)
colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 1.0)
gamma = trial.suggest_float('gamma', 0, 5)
min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
reg_lambda = trial.suggest_float('lambda', 1e-8, 10.0, log=True)
reg_alpha = trial.suggest_float('alpha', 1e-8, 10.0, log=True)

model = XGBClassifier(
    n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate,
    subsample=subsample, colsample_bytree=colsample_bytree, gamma=gamma,
    min_child_weight=min_child_weight, reg_lambda=reg_lambda, reg_alpha=reg_alpha,
    random_state=42
)

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

numeric_features = x_train.columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]),
         numeric_features)
    ])

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

model.set_params(early_stopping_rounds=10, eval_metric='logloss')
model.fit(x_train, y_train, eval_set=[(x_test, y_test)], verbose=False)

predictions = model.predict(x_test)
accuracy = accuracy_score(y_test, predictions)

return accuracy

With this, I have no problems.
Have a nice day!

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
subtype:windows Windows Build/Installation Issues TF 2.16 type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants