Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair). #7

Open
liupei101 opened this issue May 13, 2024 · 4 comments

Comments

@liupei101
Copy link

          Sorry for the late reply. Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair).

Originally posted by @guillaumejaume in #4 (comment)

@liupei101 liupei101 changed the title Sorry for the late reply. Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair). Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair). May 13, 2024
@liupei101
Copy link
Author

Hello, Jaume! Thanks for your impressive work.

You mentioned that the results with early stopping would not be fair. I want to ask if there is evidence for this statement. I am looking for fair and universal ways to evaluate the survival models, since I find that the final observed (or reported) performance is sensitive to how we evaluate.

Concretely, when doing 5-fold cross-validation with a fixed number of epochs (T), just one more training epoch could lead to a significantly different value on the final observed performance. In other words, the result of performance evaluation with 5-fold cross-validation is often sensitive to T. To avoid setting a fixed T, one could choose to adopt early stopping in the training of each fold. This way would still not be fair, as mentioned by you. One possible reason for this is that the size of the validation set is too small to support a reasonable early stopping, from my humble understanding. So, in the face of limited samples in computational pathology, what could be the fairer way to evaluate the predictive models?

Look forward to hearing from you side.

@HuahuiYi
Copy link

Hello, Jaume! Thanks for your impressive work.

You mentioned that the results with early stopping would not be fair. I want to ask if there is evidence for this statement. I am looking for fair and universal ways to evaluate the survival models, since I find that the final observed (or reported) performance is sensitive to how we evaluate.

Concretely, when doing 5-fold cross-validation with a fixed number of epochs (T), just one more training epoch could lead to a significantly different value on the final observed performance. In other words, the result of performance evaluation with 5-fold cross-validation is often sensitive to T. To avoid setting a fixed T, one could choose to adopt early stopping in the training of each fold. This way would still not be fair, as mentioned by you. One possible reason for this is that the size of the validation set is too small to support a reasonable early stopping, from my humble understanding. So, in the face of limited samples in computational pathology, what could be the fairer way to evaluate the predictive models?

Look forward to hearing from you side.

I'm also very interested in this issue. What would be the fairest way to handle it? It seems that using a fixed epoch and an early stopping mechanism with a validation set may not yield the most reliable results.

@liupei101
Copy link
Author

For someone who is also seeking fair means to configure and evaluate survival analysis models, there are some facts that could be helpful.

  • Early stopping seems frequently adopted in the survival analysis community, e.g., the representative MTLR [1] and DeepHit [2]. In these works, the scale of some datasets is similar to that of WSI-based survival analysis datasets (~1000).
  • Discrete survival models are common, like those used in SurvPath, Patch-GCN, and MCAT. In the survival analysis community, setting the number of discrete times to the square root of uncensored patient numbers is often suggested, as stated in the JMLR paper [3] and the ICML paper [4]. In addition, although the prediction is discrete, the survival time label is still continuous in performance evaluation, e.g., C-Index calculation.
  • SurvPath, Patch-GCN, and MCAT set the number of discrete times to 4 by default. Moreover, their performance metric, C-Index, is calculated using the survival time label after quantile discretization.

[1] Yu, C.-N., Greiner, R., Lin, H.-C., and Baracos, V. Learning patient-specific cancer survival distributions as a sequence
of dependent regressors. Advances in Neural Information Processing Systems, 24:1845–1853, 2011.
[2] Lee, C., Zame, W. R., Yoon, J., and van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-second AAAI conference on artificial intelligence, 2018.
[3] Haider, H., Hoehn, B., Davis, S., and Greiner, R. Effective ways to build and evaluate individual survival distributions.
Journal of Machine Learning Research, 21(85): 1–63, 2020.
[4] Qi, S. A., Kumar, N., Farrokh, M., Sun, W., Kuan, L. H., Ranganath, R., ... & Greiner, R. (2023, July). An Effective Meaningful Way to Evaluate Survival Models. In International Conference on Machine Learning (pp. 28244-28276). PMLR.

@HuahuiYi
Copy link

For someone who is also seeking fair means to configure and evaluate survival analysis models, there are some facts that could be helpful.

  • Early stopping seems frequently adopted in the survival analysis community, e.g., the representative MTLR [1] and DeepHit [2]. In these works, the scale of some datasets is similar to that of WSI-based survival analysis datasets (~1000).
  • Discrete survival models are common, like those used in SurvPath, Patch-GCN, and MCAT. In the survival analysis community, setting the number of discrete times to the square root of uncensored patient numbers is often suggested, as stated in the JMLR paper [3] and the ICML paper [4]. In addition, although the prediction is discrete, the survival time label is still continuous in performance evaluation, e.g., C-Index calculation.
  • SurvPath, Patch-GCN, and MCAT set the number of discrete times to 4 by default. Moreover, their performance metric, C-Index, is calculated using the survival time label after quantile discretization.

[1] Yu, C.-N., Greiner, R., Lin, H.-C., and Baracos, V. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. Advances in Neural Information Processing Systems, 24:1845–1853, 2011. [2] Lee, C., Zame, W. R., Yoon, J., and van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-second AAAI conference on artificial intelligence, 2018. [3] Haider, H., Hoehn, B., Davis, S., and Greiner, R. Effective ways to build and evaluate individual survival distributions. Journal of Machine Learning Research, 21(85): 1–63, 2020. [4] Qi, S. A., Kumar, N., Farrokh, M., Sun, W., Kuan, L. H., Ranganath, R., ... & Greiner, R. (2023, July). An Effective Meaningful Way to Evaluate Survival Models. In International Conference on Machine Learning (pp. 28244-28276). PMLR.

Hi, Pei!
I have a question. If I am not mistaken, it seems that in SurvPath, during 5-fold cross-validation, the validation set = the test set. This differs from DeepHit, which splits the dataset into training and test sets in an 8:2 ratio and then performs 5-fold cross-validation on the training set. DeepHit's approach explicitly separates the test set and validation set, making the use of early stopping understandable. However, with SurvPath's data splitting method, how can we ensure that the test set remains unknown and is not leaked when using early stopping?

I believe the current data splitting method has issues. Another drawback is that some datasets have very small sample sizes, and perhaps a few-shot approach might be more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants