Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CausalLanguageModeling do not mask last input item #765

Open
sungho-ham opened this issue Jan 5, 2024 · 3 comments
Open

[BUG] CausalLanguageModeling do not mask last input item #765

sungho-ham opened this issue Jan 5, 2024 · 3 comments
Labels
bug Something isn't working status/needs-triage

Comments

@sungho-ham
Copy link

Bug description

The clm masking for last item only do not mask last item in input.
It will cause using the embedding of the label instead of mask.

I think following code needs to be fixed.

mask_labels = item_ids != self.padding_idx

Steps/Code to reproduce bug

import torch
from transformers4rec.torch import masking

item_ids = torch.tensor([[1, 2, 0], ])
mask = masking.CausalLanguageModeling(hidden_size=10, train_on_last_item_seq_only=True)
masking_info = mask.compute_masked_targets(item_ids, training=True)
print(masking_info)
MaskingInfo(schema=tensor([[ True,  True, False]]), targets=tensor([[2, 0, 0]]))

Expected behavior

MaskingInfo(schema=tensor([[ True,  False, False]]), targets=tensor([[2, 0, 0]]))

Environment details

  • Transformers4Rec version: 23.08.00

Additional context

@sungho-ham sungho-ham added bug Something isn't working status/needs-triage labels Jan 5, 2024
@dcy0577
Copy link

dcy0577 commented Jan 24, 2024

I think this line of code need to be removed:

mask_labels = item_ids != self.padding_idx

As solution just use the mask_label from predict_all() above.

And I think the reason why current code somehow works is because of this part:

# shift sequence of interaction embeddings
pos_emb_inp = inputs[:, :-1]
# Adding a masked item in the sequence to return to the initial sequence.
pos_emb_inp = torch.cat( # type: ignore
[
pos_emb_inp,
torch.zeros(
(pos_emb_inp.shape[0], 1, pos_emb_inp.shape[2]),
dtype=pos_emb_inp.dtype,
).to(inputs.device),
],
axis=1,
)
# Replacing the inputs corresponding to padded items with a trainable embedding
pos_emb_inp = torch.where(
mask_schema.unsqueeze(-1).bool(),
pos_emb_inp,
self.masked_item_embedding.to(pos_emb_inp.dtype),
)
return pos_emb_inp

Given input sequence without padding [1,2,3], the mask schema generated by current code during evaluation will be [True, True, True], which exposes the last item. However the apply_mask_to_inputs will replace the last item with 0 embedding. And since the schema are all True, no mask embedding will be applied on input. I think in this case 0 embedding sort of plays a role as mask.
However, when input has padding like [1,2,3,0,0], the current mask schema will be [True, True, True, False, False]. And because the last item is a padding item, the apply_mask_to_inputs basically replaces the padding with 0 embedding. Then the mask schema comes in, masks the last 2 padding items, keeping the 1,2,3 visible to transformer.
I think thats why people encounter issues testing clm. If there are always paddings in input data, the evaluation metrics would be unrealistically high.

@jian-mo
Copy link

jian-mo commented Jan 30, 2024

I also noticed this bug as well. After the fix, the recall is down about 20% less

@zhouyu5
Copy link

zhouyu5 commented Apr 2, 2024

Any further updates? It seems #723 still not solve this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

4 participants