You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.
In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time?
Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.
In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code?
How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
I guess hashing have two main benefits, one that new items gets mapped to different hashing buckets (so not all new items will be treated the same), and having a collision can have a regularization effect, specifically if it's not too much collisions, and the fact that usually we have sparse data for rare items so dedicating a single embedding for them may overfit.
For the user_ids, You can represent the user id by the item_ids the user is consuming, to avoid retraining the model for each new user_id, if the cataloge is more stable, you'll have a more stable model representing the user id by selected features out of his previous behavior
I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.
In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time?
Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.
In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code?
How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?
Thanks in advance.
The text was updated successfully, but these errors were encountered: