Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental retrieval model training with Hashing method #703

Open
nicewenhui opened this issue Nov 13, 2023 · 2 comments
Open

Incremental retrieval model training with Hashing method #703

nicewenhui opened this issue Nov 13, 2023 · 2 comments

Comments

@nicewenhui
Copy link

I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.

In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time?
Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.

layer = layer = tf.keras.layers.Hashing(num_bins=6)
inp = [['a'], ['b'], ['c'], ['d'], ['e']]
layer(inp)

<tf.Tensor: shape=(5, 1), dtype=int64, numpy=
array([[3],
       [4],
       [4],
       [5],
       [1]])>

In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code?
How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?

Thanks in advance.

@OmarMAmin
Copy link

I guess hashing have two main benefits, one that new items gets mapped to different hashing buckets (so not all new items will be treated the same), and having a collision can have a regularization effect, specifically if it's not too much collisions, and the fact that usually we have sparse data for rare items so dedicating a single embedding for them may overfit.

@OmarMAmin
Copy link

For the user_ids, You can represent the user id by the item_ids the user is consuming, to avoid retraining the model for each new user_id, if the cataloge is more stable, you'll have a more stable model representing the user id by selected features out of his previous behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants