Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random forest runs slower on sparse input #108

Open
piotrszul opened this issue Apr 8, 2019 · 1 comment
Open

Random forest runs slower on sparse input #108

piotrszul opened this issue Apr 8, 2019 · 1 comment

Comments

@piotrszul
Copy link
Collaborator

This is noticeable by comparing runtime on sparse vs dense synthetic regression datasets.
The sparse ones run much slower although intuitively they should run faster.

@piotrszul
Copy link
Collaborator Author

This can be observed for example on the sparse synthetic datasets e.g. src/test/data/synth/synth_2000_500_fact_10_0.995-wide.csv

The reason seem to the that the very sparse data result in very deep and unbalanced trees (with for example 104 levels rather than 8 for dense data).
Because od the sparsity there is almost always a split that separates zeros from a few non zero values.
This split is usually very uneven (that is the zero side has significantly more elements). On the next level the zero split is likely to be split again in the same manner with different variable.
This results in progression of one sided splits that cut a smal portion of non zero samples at each level, going very deeply.

Like (the size of the sample set, final splits marked wiht )
[1000]
[995,5]
[990,5][5
]
[985,5][5*][5*]
....

I think this may can possibly the the case for genomics variant data (as they are very likely to be sparse, especially if not filtered for MAF).

I am not sure what the impact on the importance is but is may definitely adversely impact runtime performance.
Is may be beneficial to consider limiting the depth of the tree possibly with the mininum gini gain to consider the split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant