Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] model.evaluate() and topk_model.evaluate() do not give same metric (case of no sampled softmax) #1148

Open
murali-munna opened this issue Jun 13, 2023 · 0 comments
Labels
bug Something isn't working P1
Milestone

Comments

@murali-munna
Copy link

murali-munna commented Jun 13, 2023

Bug description

I am running experiments using sampled softmax: I am using new method (topk_encoder) to evaluate when am using sampled softmax. However, when am not using sampled softmax, I expected both the below methods to return the same metrics, however they do not.

Steps/Code to reproduce bug

Original way of evaluation:

predict_last = mm.SequenceMaskLast(schema=seq_schema, target=target, transformer=xlnet_block)
eval_results = model_transformer.evaluate(
            valid_ds,
            batch_size=512,
            pre=predict_last,
            return_dict=True
        )

New way of evaluation:

target = train_ds_schema.select_by_tag(Tags.ITEM_ID).first
max_k = 10
topk_model = model_transformer.to_top_k_encoder(k=max_k)
topk_model.compile(run_eagerly=False)

loader = mm.Loader(valid_ds, batch_size=512)
eval_results = topk_model.evaluate(loader, return_dict=True, pre=predict_last)

Expected behavior

In the case of no sampled softmax, expected both the above codes to provide the same result.

Environment details

  • Merlin version: 23.05 TF
  • Tensorflow version (GPU?): 2.11.0+nv23.2

Additional context

REPRODUCIBLE EXAMPLE (from Ronay):

import os
import itertools

import numpy as np
import tensorflow as tf

import merlin.models.tf as mm
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io import Dataset
from merlin.schema import Tags
from merlin.datasets.synthetic import generate_data

sequence_testing_data = generate_data("sequence-testing", num_rows=100)
sequence_testing_data.schema = sequence_testing_data.schema.select_by_tag(
        Tags.SEQUENCE
    ).select_by_tag(Tags.CATEGORICAL)
seq_schema = sequence_testing_data.schema

item_id_name = seq_schema.select_by_tag(Tags.ITEM).first.properties['domain']['name']

target = sequence_testing_data.schema.select_by_tag(Tags.ITEM_ID).column_names[0]

query_schema = seq_schema
output_schema = seq_schema.select_by_name(target)

d_model = 48
BATCH_SIZE = 32

dmodel = int(os.environ.get("dmodel", '48'))

input_block = mm.InputBlockV2(
    query_schema,    
    embeddings=mm.Embeddings(
        seq_schema.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner=None,
        dim=dmodel
        ))

xlnet_block = mm.XLNetBlock(d_model=dmodel, n_head=2, n_layer=2)

def get_output_block(schema, input_block=None):
    candidate_table = input_block["categorical"][item_id_name]
    to_call = candidate_table
    outputs = mm.CategoricalOutput(to_call=to_call)
    return outputs

output_block = get_output_block(seq_schema, input_block=input_block)

projection = mm.MLPBlock(
    [128, output_block.to_call.table.dim],
    no_activation_last_layer=True,
)
session_encoder = mm.Encoder(
    input_block,
    mm.MLPBlock([128, dmodel], no_activation_last_layer=True),
    xlnet_block,
    projection,
)
model = mm.RetrievalModelV2(query=session_encoder, output=output_block)
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.005,
)

loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(
    run_eagerly=False, 
    optimizer=optimizer, 
    loss=loss,
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[10])
)
model.fit(
    sequence_testing_data, 
    batch_size=32, 
    epochs=1, 
    pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.3, transformer=xlnet_block)
)

predict_last = mm.SequenceMaskLast(schema=seq_schema, target=target, transformer=xlnet_block)
model.evaluate(
    sequence_testing_data,
    batch_size=BATCH_SIZE,
    pre=predict_last,
    return_dict=True
)

Once this is run, now please run the following and compare the metric values coming from model.evaluate() above and topk_model.evaluate below. The results do not match.

loader = mm.Loader(sequence_testing_data, batch_size=BATCH_SIZE)
max_k = 10
topk_model = model.to_top_k_encoder(k=max_k)
topk_model.compile(run_eagerly=False)

metrics = topk_model.evaluate(loader, return_dict=True, pre=predict_last)
metrics

Please note that the metrics results change each time we rerun topk_model.evaluate(). I added shuffle=False in the loader, but I still get metrics values different.

@murali-munna murali-munna added bug Something isn't working status/needs-triage labels Jun 13, 2023
@rnyak rnyak added the P0 label Jun 14, 2023
@rnyak rnyak added this to the Merlin 23.07 milestone Jun 14, 2023
@rnyak rnyak changed the title [BUG] trainer.evaluate() and topk_model.evaluate() do not give same metric (case of no sampled softmax) [BUG] model.evaluate() and topk_model.evaluate() do not give same metric (case of no sampled softmax) Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

3 participants