Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catboost generates very large artifacts and can have unstable learning #47

Open
pseudotensor opened this issue Oct 20, 2019 · 1 comment
Assignees

Comments

@pseudotensor
Copy link
Contributor

catboost/catboost#1023
catboost/catboost#1028

@pseudotensor pseudotensor self-assigned this Oct 20, 2019
@pseudotensor
Copy link
Contributor Author

Hello!
I guess you have lot's of categorical features in your dataset (possibly with high cardinality).
When we are training models, we generate CTR tables for categorical features on-the-fly as they are needed, so it's totally normal, that GPU memory usage shows practically no correlation with resulting model size - we calculate all selected CTR tables after training savinge them in the model object.
To reduce model size, in 0.24 we have finally implemented model size regularization - now we are penalyzing model splits that are using large CTR tables. model_size_reg is now turned on by default both on CPU and GPU and set to 0.5. You can play with this parameter, raising it to achieve smaller model size.
Also, you could reduce model size by limiting CTR complexity, setting max_ctr_complexity parameter - by default we are trying to greedily make combinations with up to 4 categorical features.
You can read about this params in the new blog post on towardsdatascience and in tutorial covering categorical feature parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant