Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Survival models #71

Closed
yunwezhang opened this issue Apr 20, 2021 · 8 comments
Closed

Survival models #71

yunwezhang opened this issue Apr 20, 2021 · 8 comments
Labels
feature request New feature or request

Comments

@yunwezhang
Copy link

Hi maintainer,

I am wondering is that possible to cascade random survival forest (maybe a sksurv model) instead of RF in your deep forest model? As in #48, it seems that the supported model types are classification and regression. (or did I miss some parts of those tutorial docs?)

Thanks.

@xuyxu
Copy link
Member

xuyxu commented Apr 20, 2021

Hi @yunwezhang, after walking through the example on random survival forest in sksurv, I think the biggest problem on using deep forest in survival analysis tasks is how to design good augmented features. In survival analysis, our main concern is the survival predicting function that takes time steps t as the input, right? For now, I cannot figure out how to ingest this into the cascade structure of deep forest.

Since we are not quite familiar with survival analysis, your suggestions would be highly welcomed ;-)

EDIT: We are happy to work on this feature request if this is achievable.

@xuyxu xuyxu added the feature request New feature or request label Apr 20, 2021
@yunwezhang
Copy link
Author

Hi Yixuan,

Yes, you are right about the time steps, the input part of survival models requires a 2-dim thing as the outcome (time+binary status, where this binary means censored or not) but the output is usually a 1-dim vector, either "risk" or "probability" (as in binary classification).

As for the augmented feature steps, i assume you are talking about this part in the model structure?
image
Is this part corresponding to this part in the paper?
image
Because to me, if I understand correctly, in the cascade forest part, the augmented features (in-model feature transformation) obtained from each forest are the predicted vectors, which can be obtained from a survival forest (the output survival probability). However, I am not clear about the attached picture part. (I think the 2019 paper has it because it is better for image data....)

Thank you for looking into it and I am not sure how hard it is to add the random survival model. I am happy to chat with you to see how it goes. In summary, the change for the input data needs to be X (n by p), y (both time and status) and the output is probability vector (could be survival risk, 1 year survival probability, 2 year survival prob, etc.) 😊

@xuyxu
Copy link
Member

xuyxu commented Apr 21, 2021

Thanks for your kind explanations @yunwezhang.

As for the augmented feature steps, i assume you are talking about this part in the model structure?
Is this part corresponding to this part in the paper?

No, the second figure posted by you shows the multi-grained scanning part, which is not included in this package, since tree ensembles are typically not the best choice for structured data such as images or audios. Augmented features refer to part of the input for hidden cascade layers. For classification, they are predicted class probabilities; For regression, they are predicted target values.

Here are three questions that I would like to ask further.

  • Is random survival forest the state-of-the-art model for survival analysis.
  • To conclude your explanations, what we need to modify is the input format, right? Apart from X, we also need to enroll an indicator array on time and status.
  • Can we use the risk score as the augmented features?

@xuyxu xuyxu mentioned this issue Apr 21, 2021
13 tasks
@yunwezhang
Copy link
Author

yunwezhang commented Apr 21, 2021

Hi Yixuan,

Thanks for the fast reply. I am aware that the multi-grain scanning is not included and that's why I asked why do you have the part (first figure) in your model structure instead of starting from the cascade forest.

Answer for the further questions:

  1. To me, RSF yes. There are some DNN based survival models also considered stated-of-art. (But RSF has the best performance in general among the several datasets I tried.)
  2. Yes, only modify the input part (not the feature matrix X_train but the response part y_train) because the output of the RSF is 1-dim, the probability.
  3. I think both the risk score and the predicted probability can be used as augmented features. (My experience of using RSF is in R but theoretically, both packages are based on the same paper so there will be no difference.) My reading from that package shows that the survival probability is not provided in the predict function though.

@xuyxu
Copy link
Member

xuyxu commented Apr 21, 2021

Thanks for the fast reply. I am aware that the multi-grain scanning is not included and that's why I asked why do you have the part (first figure) in your model structure instead of starting from the cascade forest.

The binner in that figure is used to reduce the number of splitting candidates for the sake of acceleration (not used in the original deep forest model). The entire architecture does correspond to the cascade forest structure.

Besides, I have opened up a feature request in sksurv (link), deep forest could benefit from using a mixture of RandomSurvivalForest and ExtraSurvivalTrees in cascade layers. Let's wait for the response from maintainers of sksurv before formally working on this feature request ;-)

@yunwezhang
Copy link
Author

got it!
yes, let's wait for the reply. To have that extra injection of randomness, it would be better to have ExtraSurvivalTrees.

@xuyxu
Copy link
Member

xuyxu commented Apr 23, 2021

Realizing that we can implement ExtraSurvivalTrees by importing sksurv as a soft dependency, I think we could work on this feature request without extra helps from that community.

Thank you for looking into it and I am not sure how hard it is to add the random survival model. I am happy to chat with you to see how it goes. In summary, the change for the input data needs to be X (n by p), y (both time and status) and the output is probability vector (could be survival risk, 1 year survival probability, 2 year survival prob, etc.) 😊

If you are interested in extending deep forest to the field of survival analysis, could you contact me through an e-mail (Address), so that we can have more discussions before opening a draft PR on this feature ;-)

@xuyxu
Copy link
Member

xuyxu commented Apr 29, 2021

Closed via #14.

@xuyxu xuyxu closed this as completed Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants