Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed control in unitxt #549

Open
yoavkatz opened this issue Feb 1, 2024 · 1 comment
Open

Seed control in unitxt #549

yoavkatz opened this issue Feb 1, 2024 · 1 comment

Comments

@yoavkatz
Copy link
Member

yoavkatz commented Feb 1, 2024

Today, unitxt uses a default seed (42) for all dataset. It's not actually possible to change the seed today.
Changing the seed could effect the dataset significantly given random choices, so it should be controlled.

I initially thought it should be a parameter of the standard recipe so it will be explicit (also to ensure that HF caching will work correctly)

However, if we load multiple recipes and they set multiple seeds and collide.

@elronbandel @matanor - what do you think? I don't see a good solution.

@matanor
Copy link
Member

matanor commented Feb 4, 2024

I think it would have been nice if we could have had params set on the recipie (meaning, passed between operators along the pipeline), and then they could be accessed by the individual operators. Then you could have a sub_seed set there, and operators could have used to create their random generators based on the per-pipelline sub_seed and the global seed. Maybe that could be done by adding a dict of params to StreamingOperator?

Without that, a potential solution is setting the sub_seed on the instances.. like we do with other pipeline-level params (e.g. the name of the metrics).. that's IMO not a very nice long term solution, but maybe its ok, not sure.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants