-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_batches_per_epoch in network_trainer and low gpu-util #61
Comments
Hi zhenyu,
You are correct. This is a bug. Great work spotting it! I will fix it today. About the num_batches_per_epoch thing. The whole concept of epoch is kind of obsolete in patch based training. You can never guarantee the network sees all of the training data because of the way the patches are sampled (randomly). My personal opinion is that the whole idea of really iterating over the whole dataset in each epoch is nonsense. You can just as well keep sampling examples randomly forever. This is just the way I do it. You can adapt it so something else if you want. I would not expect a performance improvement though. Lastly about your GPU issue. The GPU usage should be above 95% on average so there is definitely something weird going on. I need more information about the issue to be able to help. Most importantly, I need to know what your CPU usage is like (is the CPU maxed out or is it idling as well?) and also what kind of storage you are using (HDD or SSD). Best, |
Hello Fabian,
second, during training, the val_eval_criterion_MA can be either all_val_losses or val_eval_metrics, in your code it seems val_eval_metrics always keep empty, however, in plot_progress(), it use self.all_val_eval_metrics to plot the training curve of "evaluation metric", its a little bit strange for me.
many thanks, |
Hi zhenyu, The Then what is plotted is quite... well... confusing. I know. If self.run_online_evaluation is implemented, then self.all_val_eval_metrics is going to be whatever the output of that is. If self.run_online_evaluation is not implemented we don't plot anything. Best, |
Hello Fabian I find it 😂 :) , during validation, run_online_evaluation = true. Thanks your work again 👏 👍 💯 . best, |
Nice, I forgot about that one :-D about your speed issue: I really don't know what's going on. My guess is that the SSD is to blame. The USB interface is likely not fast enough to handle the data transfers. Please try to build the ssd directly into the computer (vie sata or nvme interface). |
(by not fast enough I mean the number of requests, not raw throughput) |
Hello Fabian I have checked the code, when choose contunie training, the latest checkpoint will be loaded, and model will be trained on the previous status, however, the self.train_loss_MA didn't saved in checkpoint, and it will choose last train loss as new start value. The problem is that self.train_loss_MA actually computed by accumulating all_train_loss. Since the lr_scheduler depends on self.train_loss_MA, I am currently not sure whether it will impact the lr update and cause early training stop. Intuitively, it seems better save self.train_loss_MA as well in checkpoint?
Finally, about the speed issue, when I trained my model on 3d low_res mode(160, ,160,80), the training time can be very fast. however, when trained with 3d full_res(192,192,48), it became very very slow, I am not sure if it has anything to do with the do_dummy_aug, since self.do_dummy_aug=True in 3d_full_res while False in 3d_low_res. best, |
Hi, |
Hello Fabian Thank you for your reply, I constanly restarting the training becase I have no patience 😂 😂. And I now believe your are right its really not a problem about loading self.train_loss_MA when retraining. Also, I have a question about the validation result computing: in your manuscript submitted on kits2019, I wonder the kidney and tumor dice (97.34 and 85.04) was computed on global dice, mean dice in summery.json or computed according the kits challenge offical evaluation? Beside, I also found some error labed data and I seen your prediction result from the issue. It seems you used the uninterpolated data for training? Have you used mirrorring augmentation in the training? So sorry i have too much questions 😂 since even i modeifed the training data as you did, there still performance gaps (~5% on tumor dice) with your plain 3D UNet, thus I tired to know about as much details, hope this will not bother you. Also thank you for your patience for previous kind replies. many thanks, |
The same as the official evaluation
I interpolated all the data to some common voxel spacing as specified in the paper.
yes
You will not get my 3d Unet performance with the current version of nnU-Net :-) I have an improved version coming up at some point in the future. some variant of that version is what I ran in the KiTS challenge Best, |
got it, great thanks!!!🎉 best, |
Hello Fabian Isensee
Thanks for your sharing, it's really awesome work. I cannot agree with you more that unet is really powerful architecture in medical image segmentation, I tried different latest networks and modules in CV, none of them could outperform simple unet. Recently, I took several days to read and run your nnunet on Kits2019, there are few questions confused me:
First, according to my understanding, the num of plan stage was decided by if the computed input size large than a fixed proportion of median_shape_size, in this case, shouldn't the architecture_input_voxels be plans['input_patch_size']? while it is a pre-defined size in your code.
Second, during each training epoch, the model was run on the entire dataset, However, the self.num_batches_per_epoch seems a fixed number:
self.num_batches_per_epoch = 250
self.num_val_batches_per_epoch = 50
if each time the self.run_iteration compute loss on one batch_size, say 2 or other number, shouldn't the self.num_batches_per_epoch changed accordingly?
Third, I tired to run two folds parallel on different GPU, have readed this the issue, I set MKL_NUM_THREADS=1,NUMEXPR_NUM_THREADS=1,OMP_NUM_THREADS=1 in command line runing. But the GPU-Util still really low, most of time it are 0, i didn't change the default num_threads in batchgenerator_train and val, I have really no idea about it?
CPU:
GPU:
your reply will be highly appreciated!
many thanks,
zhenyu
The text was updated successfully, but these errors were encountered: