Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsl.ml: Random Forest #126

Open
2 tasks
ulises-jeremias opened this issue Dec 18, 2022 · 16 comments
Open
2 tasks

vsl.ml: Random Forest #126

ulises-jeremias opened this issue Dec 18, 2022 · 16 comments
Labels
Hacktoberfest This label is assigned to any issue that is good to go for any Hacktoberfest participant

Comments

@ulises-jeremias
Copy link
Member

ulises-jeremias commented Dec 18, 2022

Describe the feature

We want to create a new model on vsl.ml to do classification using the Random Forest algorithm. That model should follow the following interfaces:

[heap]
pub struct RandomForest {
mut:
	name       string     // name of this "observer"
	data       &Data[f64] // x data
	stat       &Stat[f64] // statistics about x (data)
        min_samples_split int
        max_depth int
}

With the following methods

        name() string
mut:
        update() // called by Data when it changes
        train()
        predict(x [][]f64) []f64

Use Case

Proposed Solution

Other Information

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

Version used

Environment details (OS name and version, etc.)

@BMJHayward
Copy link

I've been making some progress on this in my own fork.
Not sure how to "share" the data object between the random forest &Data[f64] and individual trees yet though.

@ulises-jeremias
Copy link
Member Author

@BMJHayward feel free to send me a Draft Pull Request or the link to the branch where you have your changes so I can suggest how to do it 😊

@BMJHayward
Copy link

Thanks @ulises-jeremias here's my current branch, not quite ready for PR:
https://github.com/BMJHayward/vsl/tree/126_implement_random_forest

I thought to make a proper decision tree implementation and use that in the RF as well, but using the Data interface makes it tricky as I can't just use Data.x and Data.y. Or can I? Each tree uses several and different indexes.

@ulises-jeremias
Copy link
Member Author

yeah, I think it is not enough. The Data.x and Data.y should be used only to replace the data you were receiving here. You will probably need to have another struct, or multiple instances of Data, but probably this last is lest efficient

@ulises-jeremias
Copy link
Member Author

ulises-jeremias commented Jan 17, 2023

I just updated latest master adding some methods.

there are two ways you can do this now:

  • creating a new instance of data and sharing the reference to x and the new y
mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?

or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index

mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?

@ulises-jeremias
Copy link
Member Author

@BMJHayward ^^

@BMJHayward
Copy link

I just updated latest master adding some methods.

there are two ways you can do this now:

* creating a new instance of data and sharing the reference to `x` and the new `y`
mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?

or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index

mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?

excellent thankyou, I'll take a look over the weekend

@ulises-jeremias
Copy link
Member Author

@BMJHayward hey! did that work? is there anything else I can do to help?

@BMJHayward
Copy link

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

@dumblob
Copy link

dumblob commented Jan 31, 2023

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

@ulises-jeremias
Copy link
Member Author

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

hey! don't rush with it. Family is more important 😊

About the question, I think calling set_y multiple times is OK as soon as the .clone() method is used 👌🏻

@ulises-jeremias
Copy link
Member Author

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

lazypredict is great! we will probably add more models during time 👌🏻

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

@dumblob
Copy link

dumblob commented Feb 1, 2023

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

Yep, .h5 is fine. Maybe to not slow down the computation we could just fork the process (i.e. delegate COW of all the structs with data to the operating system as e.g. Redis does) so takes a negligible time and then save it to disk. The data might have easily hundreds of MB or more, so not doing it fully in parallel could slow down the computation too much (and V's threading support is probably not enough as it would involve memcpy() which would be definitely much slower than COW over pages the operating systems maintains under the hood). Just a thought.

@dumblob
Copy link

dumblob commented Mar 22, 2023

I wonder if there is any news regarding Cox Regression, Random Forest, and .h5 checkpointing. I could not find anything in the commits.

But no pressure, I just want to regularly get up to date 😉.

@dumblob
Copy link

dumblob commented Jul 31, 2023

Any news? Especially the checkpointing seems highly beneficial to everybody (compared to Cox Regression and Random Forest).

@ulises-jeremias ulises-jeremias added the Hacktoberfest This label is assigned to any issue that is good to go for any Hacktoberfest participant label Oct 7, 2023
@dumblob
Copy link

dumblob commented Feb 10, 2024

Still interested in this to allow me start recommending V (VSL) within my bubble 😉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest This label is assigned to any issue that is good to go for any Hacktoberfest participant
Projects
None yet
Development

No branches or pull requests

3 participants