Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train/Validate/Test Clarification #97

Open
lcaronson opened this issue Aug 30, 2021 · 6 comments
Open

Train/Validate/Test Clarification #97

lcaronson opened this issue Aug 30, 2021 · 6 comments
Assignees
Labels
information question Further information is requested

Comments

@lcaronson
Copy link

Hello,

I just have a couple of clarifying questions to ask.

I am a little bit unclear how you come up with final 0.9544 Dice coefficient value in the published MIScnn paper. Is there some kind of additional function that can be used to compare the test data to predictions? Or is that value returned during the cross-validation phase?

If the cross-validation phase is also doing the testing of the data, then how do we define the ratio of train/validate/test data? For example, my understanding is that of the roughly 300 studies in the KiTS19 dataset, you did 80/90/40 ratio? I am just trying to figure out how you set these parameters in the code.

As a final question for you, if I have a dataset of 60 studies, would a decent train/validate/test ratio be 30/15/15?

@muellerdo
Copy link
Member

Hey @lcaronson,

thanks for your interest in using MIScnn!

I am a little bit unclear how you come up with final 0.9544 Dice coefficient value in the published MIScnn paper. Is there some kind of additional function that can be used to compare the test data to predictions? Or is that value returned during the cross-validation phase?

The DSC of 0.9544 for the kidney segmentation were automatically computed with our cross-validation function (https://github.com/frankkramer-lab/MIScnn/blob/master/miscnn/evaluation/cross_validation.py). With default parameters -> without any callbacks, no validation monitoring is performed by which the returning cross-validation set can be used as testing sets.

However, you can always run a prediction call by yourself and then compute the associated DSCs.
You can find an example for this approach in our CellTracking example with 2D microscopy images: https://github.com/frankkramer-lab/MIScnn/blob/master/examples/CellTracking.ipynb

If the cross-validation phase is also doing the testing of the data, then how do we define the ratio of train/validate/test data? For example, my understanding is that of the roughly 300 studies in the KiTS19 dataset, you did 80/90/40 ratio? I am just trying to figure out how you set these parameters in the code.

For the KiTS19, we didn't used any validation set and computed our scores purely on the testing sets from the 3-fold cross-validation. Also we only used a subset of 120 samples. -> 3x (80 train & 40 test)
We did this to demonstrate a default approach with any more advanced validation monitoring techniques.

However, in our more recent covid-19 segmentation based on limited data, we used a cross-val (train/val) and testing strategy.
https://www.sciencedirect.com/science/article/pii/S2352914821001660?via%3Dihub

In this study, we performed a 5-fold cross-validation on only 20 samples and computed 5 models (each fold returning a model).
Then, we computed predictions on a completely separated hold-out set of 100 samples (from another source). We computed the 5 predictions for each samples (from each fold-model one and then averaged these 5 predictions into a single one = ensemble learning). Afterwards, we computed the DSC on the ensembled/final predictions.
In the paper, we did also some more fancy stuff to prove that the ensemble learning strategy is highly efficient and MIScnn is capable to produce robust models with it based on even such a low number of samples like 20.
Here the complete COVID-19 study code: https://github.com/frankkramer-lab/covid19.MIScnn

As a final question for you, if I have a dataset of 60 studies, would a decent train/validate/test ratio be 30/15/15?

Sadly, there is no clear answer to this question. Personally, I would highly recommend for a 80/20 percentage split into train/test and then run a 3-fold or 5-fold cross-validation on the 80% training data. For testing then utilizing ensembling learning techniques. This is the state-of-the-art approach and will gain great performance.
Otherwise, I'm a personal fan of running a 65/15/20 split for train/val/test. It is highly depended on how much data you have. 60 samples are quite low in terms of neural networks (even if it's a very good dataset in a medical perspective due to the complexity to generate annotated medical imaging datasets!!), which is why I'm a big fan of cross-validation.

Hope that I was able to give you some insights/feedback! :)

Cheers,
Dominik

@muellerdo muellerdo self-assigned this Sep 8, 2021
@muellerdo muellerdo added the question Further information is requested label Sep 8, 2021
@emmanuel-nwogu
Copy link

then averaged these 5 predictions into a single one

Hi, please how did you average these predictions into one? Did you just average the metrics computed from the 5 predictions for each sample?

@muellerdo
Copy link
Member

Hey @emmanuel-nwogu,

correct. In this study, we just averaged the predictions pixelwise via mean.

Cheers,
Dominik

@emmanuel-nwogu
Copy link

Thanks for the reply. From my understanding, you average the predicted binary masks to generate a final prediction mask. Is there a common name for this in the literature?

@muellerdo
Copy link
Member

Happy to help! :)

Absolutely correct!

Sadly, to my knowledge, there is no community-accepted name for functions to combine predictions originating from ensemble learning.
Most of the time you read about ensemble learning in biomedical image classification or segmentation, the authors applied averaging via mean and also just call it averaging.

Last year, we published an experiment analysis about ensemble learning in medical image classification, in which I called the combination methods to merge multiple predictions pooling functions and the averaging as mean (which can be either unweighted as well as weighted).
Check out here: https://ieeexplore.ieee.org/document/9794729/

Hope this explains/helps a little bit on general ensemble learning in biomedical image analysis.

Best Regards,
Dominik

@emmanuel-nwogu
Copy link

Thanks, I'll check it out. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
information question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants