You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, unitxt uses a default seed (42) for all dataset. It's not actually possible to change the seed today.
Changing the seed could effect the dataset significantly given random choices, so it should be controlled.
I initially thought it should be a parameter of the standard recipe so it will be explicit (also to ensure that HF caching will work correctly)
However, if we load multiple recipes and they set multiple seeds and collide.
I think it would have been nice if we could have had params set on the recipie (meaning, passed between operators along the pipeline), and then they could be accessed by the individual operators. Then you could have a sub_seed set there, and operators could have used to create their random generators based on the per-pipelline sub_seed and the global seed. Maybe that could be done by adding a dict of params to StreamingOperator?
Without that, a potential solution is setting the sub_seed on the instances.. like we do with other pipeline-level params (e.g. the name of the metrics).. that's IMO not a very nice long term solution, but maybe its ok, not sure.
Today, unitxt uses a default seed (42) for all dataset. It's not actually possible to change the seed today.
Changing the seed could effect the dataset significantly given random choices, so it should be controlled.
I initially thought it should be a parameter of the standard recipe so it will be explicit (also to ensure that HF caching will work correctly)
However, if we load multiple recipes and they set multiple seeds and collide.
@elronbandel @matanor - what do you think? I don't see a good solution.
The text was updated successfully, but these errors were encountered: