num_batches_per_epoch in network_trainer and low gpu-util #61

Zakiyi · 2019-09-10T15:31:25Z

Hello Fabian Isensee
Thanks for your sharing, it's really awesome work. I cannot agree with you more that unet is really powerful architecture in medical image segmentation, I tried different latest networks and modules in CV, none of them could outperform simple unet. Recently, I took several days to read and run your nnunet on Kits2019, there are few questions confused me:
First, according to my understanding, the num of plan stage was decided by if the computed input size large than a fixed proportion of median_shape_size, in this case, shouldn't the architecture_input_voxels be plans['input_patch_size']? while it is a pre-defined size in your code.

  # experiment_planner_baseline_3DUNet.py
  architecture_input_voxels = np.prod(generic_UNet.DEFAULT_PATCH_SIZE_3D)
  if np.prod(self.plans_per_stage[-1]['median_patient_size_in_voxels'], dtype=np.int64) / \
                architecture_input_voxels < HOW_MUCH_OF_A_PATIENT_MUST_THE_NETWORK_SEE_AT_STAGE0:
            more = False

       Second, during each training epoch, the model was run on the entire dataset, However, the self.num_batches_per_epoch seems a fixed number:
       self.num_batches_per_epoch = 250
       self.num_val_batches_per_epoch = 50
if each time the self.run_iteration compute loss on one batch_size, say 2 or other number, shouldn't the self.num_batches_per_epoch changed accordingly?

            for b in range(self.num_batches_per_epoch):
                l = self.run_iteration(self.tr_gen, True)
                train_losses_epoch.append(l)

Third, I tired to run two folds parallel on different GPU, have readed this the issue, I set MKL_NUM_THREADS=1,NUMEXPR_NUM_THREADS=1,OMP_NUM_THREADS=1 in command line runing. But the GPU-Util still really low, most of time it are 0, i didn't change the default num_threads in batchgenerator_train and val, I have really no idea about it?
CPU:

CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Socket(s): 2
NUMA node(s): 2
Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55

GPU:

| 0 GeForce GTX 108... Off | 00000000:02:00.0 On | N/A |
| 0% 52C P5 20W / 250W | 10529MiB / 11169MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
| 0% 49C P8 14W / 250W | 10981MiB / 11178MiB | 0% Default |

your reply will be highly appreciated!
many thanks,
zhenyu

The text was updated successfully, but these errors were encountered:

FabianIsensee · 2019-09-11T07:58:06Z

Hi zhenyu,
thank you for your kind words. Let me try to answer your questions :-)

First, according to my understanding, the num of plan stage was decided by if the computed input size large than a fixed proportion of median_shape_size, in this case, shouldn't the architecture_input_voxels be plans['input_patch_size']? while it is a pre-defined size in your code.

You are correct. This is a bug. Great work spotting it! I will fix it today.

About the num_batches_per_epoch thing. The whole concept of epoch is kind of obsolete in patch based training. You can never guarantee the network sees all of the training data because of the way the patches are sampled (randomly). My personal opinion is that the whole idea of really iterating over the whole dataset in each epoch is nonsense. You can just as well keep sampling examples randomly forever. This is just the way I do it. You can adapt it so something else if you want. I would not expect a performance improvement though.
Also note that the duration of the training can vary between datasets due to early stopping. This is my way of dealing with different numbers of training cases (smaller datasets converge faster and stop earlier).

Lastly about your GPU issue. The GPU usage should be above 95% on average so there is definitely something weird going on. I need more information about the issue to be able to help. Most importantly, I need to know what your CPU usage is like (is the CPU maxed out or is it idling as well?) and also what kind of storage you are using (HDD or SSD).

Best,
Fabian

Zakiyi · 2019-09-12T05:11:07Z

Hello Fabian,
    Thank you for the reply, I now understand the num_batch and epoch in the context of patch based training, your way make sense.
    About the gpu util, i use a SSD with usb3.0, when i only train one fold on one gpu, i found that during first several epochs the GPU-util keep low most time are 0, the training time can be quite long, more than one hour for 3d full res unet with patch_size=192x192x48, batch_size=2 on GTX 1080, then it will gradually reduce to a normal state, ~1000s per epoch. cpu most time are idling, perhaps there some problem with my SSD. sorry for asking about this kind of problem :(
    Also, I have two another questions: first, in the network_architecture/generic_UNet.py, when constructing ConvDropoutNormNonlin module, if the non_lin is fixed as leaky relu, its fine, but if one pass it from nnUNetTrainer, then the non_lin doesn't change in ConvDropoutNormNonlin module. anyway, i am not sure this is a problem or not ?

self.lrelu = nn.LeakyReLU(**self.nonlin_kwargs)

second, during training, the val_eval_criterion_MA can be either all_val_losses or val_eval_metrics, in your code it seems val_eval_metrics always keep empty, however, in plot_progress(), it use self.all_val_eval_metrics to plot the training curve of "evaluation metric", its a little bit strange for me.

if len(self.all_val_eval_metrics) == len(self.all_val_losses): ax2.plot(x_values, self.all_val_eval_metrics, color='g', ls='--', label="evaluation metric")

many thanks,
zhenyu

FabianIsensee · 2019-09-12T06:18:22Z

Hi zhenyu,
you seem to dig really deep into the code. I like that. And THANK YOU. It's great to have somebody point out all my coding mishaps :-)

The self.lrelu = nn.LeakyReLU(**self.nonlin_kwargs) part is quite embarrassing, especially because I already fixed it in my internal repo and sinply forgot to also update the code here. I will do that today.

Then what is plotted is quite... well... confusing. I know. If self.run_online_evaluation is implemented, then self.all_val_eval_metrics is going to be whatever the output of that is. If self.run_online_evaluation is not implemented we don't plot anything.
val_eval_criterion_MA has nothing to do with plotting. This is for epoch selection only. If self.run_online_evaluation is implemented, val_eval_criterion_MA is the moving average of that metric. If it is not implemented, we still need to somehow select epochs, so we fall back to using the validation loss for that.

Best,
Fabian

Zakiyi · 2019-09-12T07:23:50Z

Hello Fabian

I find it 😂 :) , during validation, run_online_evaluation = true. Thanks your work again 👏 👍 💯 .

best,
zhenyu

FabianIsensee · 2019-09-12T09:29:13Z

Nice, I forgot about that one :-D

about your speed issue: I really don't know what's going on. My guess is that the SSD is to blame. The USB interface is likely not fast enough to handle the data transfers. Please try to build the ssd directly into the computer (vie sata or nvme interface).
Best,
Fabian

FabianIsensee · 2019-09-12T09:29:34Z

(by not fast enough I mean the number of requests, not raw throughput)

Zakiyi · 2019-09-20T04:19:51Z

Hello Fabian
These days I find another problem, I trained the nnunet on kits2019, although the max epoch set as 1000, it stop training on 300 epoch since min lr reached, I believed it is weird. During training, I stop and continue training many times, I am not sure if this have any effect on the training process.

I have checked the code, when choose contunie training, the latest checkpoint will be loaded, and model will be trained on the previous status, however, the self.train_loss_MA didn't saved in checkpoint, and it will choose last train loss as new start value. The problem is that self.train_loss_MA actually computed by accumulating all_train_loss. Since the lr_scheduler depends on self.train_loss_MA, I am currently not sure whether it will impact the lr update and cause early training stop. Intuitively, it seems better save self.train_loss_MA as well in checkpoint?

self.all_tr_losses, self.all_val_losses, self.all_val_losses_tr_mode, self.all_val_eval_metrics = saved_model['plot_stuff']

Finally, about the speed issue, when I trained my model on 3d low_res mode(160, ,160,80), the training time can be very fast. however, when trained with 3d full_res(192,192,48), it became very very slow, I am not sure if it has anything to do with the do_dummy_aug, since self.do_dummy_aug=True in 3d_full_res while False in 3d_low_res.

best,
zhenyu

FabianIsensee · 2019-09-20T08:45:18Z

Hi,
why are you constantly restarting the training? I have not explored the effect this has on performance in depth. I have to restart trainings myself some time to time but have not observed that this is a problem.
self.train_loss_MA is indeed not saved and maybe it should be. But I am not changing that now because I am working on a new version of nnU-net that handles the learning rate differently, so whatever I do would be obsolete soon.
Please also note that after loading self.train_loss_MA will be initialized to the first training loss you get. After that the model hat 30 epochs to improve upon that before anything is done to the learning rate. So it should not be too much of a problem.
I find it interesting that you have such a speed issue. I have never observed anything like this. The only idea I have regarding that is that you have some I/O bottleneck. The 3d_lowres data being much smaller, your OS will probably be able to cache everything and thus overcome the bottleneck. Please try to put the data on a ssd (which is connected via sata or m.2, not USB!).
Best,
Fabian

Zakiyi · 2019-09-22T13:11:56Z

Hello Fabian

Thank you for your reply, I constanly restarting the training becase I have no patience 😂 😂. And I now believe your are right its really not a problem about loading self.train_loss_MA when retraining.

Also, I have a question about the validation result computing: in your manuscript submitted on kits2019, I wonder the kidney and tumor dice (97.34 and 85.04) was computed on global dice, mean dice in summery.json or computed according the kits challenge offical evaluation? Beside, I also found some error labed data and I seen your prediction result from the issue. It seems you used the uninterpolated data for training? Have you used mirrorring augmentation in the training?

So sorry i have too much questions 😂 since even i modeifed the training data as you did, there still performance gaps (~5% on tumor dice) with your plain 3D UNet, thus I tired to know about as much details, hope this will not bother you. Also thank you for your patience for previous kind replies.

many thanks,
zhenyu

FabianIsensee · 2019-09-23T08:44:35Z

was computed on global dice, mean dice in summery.json or computed according the kits challenge offical evaluation?

The same as the official evaluation

It seems you used the uninterpolated data for training

I interpolated all the data to some common voxel spacing as specified in the paper.

Have you used mirrorring augmentation in the training?

yes

So sorry i have too much questions since even i modeifed the training data as you did, there still performance gaps (~5% on tumor dice) with your plain 3D UNet, thus I tired to know about as much details, hope this will not bother you. Also thank you for your patience for previous kind replies.

You will not get my 3d Unet performance with the current version of nnU-Net :-) I have an improved version coming up at some point in the future. some variant of that version is what I ran in the KiTS challenge

Best,
Fabian

Zakiyi · 2019-09-24T05:15:41Z

got it, great thanks!!!🎉

best,
zhenyu

Zakiyi closed this as completed Sep 24, 2019

ahmedhosny mentioned this issue Apr 10, 2020

compute vram consumption #173

Closed

yilei-wu mentioned this issue Jan 3, 2022

minibatch generation #897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_batches_per_epoch in network_trainer and low gpu-util #61

num_batches_per_epoch in network_trainer and low gpu-util #61

Zakiyi commented Sep 10, 2019 •

edited

FabianIsensee commented Sep 11, 2019

Zakiyi commented Sep 12, 2019 •

edited

FabianIsensee commented Sep 12, 2019 •

edited

Zakiyi commented Sep 12, 2019

FabianIsensee commented Sep 12, 2019

FabianIsensee commented Sep 12, 2019

Zakiyi commented Sep 20, 2019 •

edited

FabianIsensee commented Sep 20, 2019

Zakiyi commented Sep 22, 2019 •

edited

FabianIsensee commented Sep 23, 2019

Zakiyi commented Sep 24, 2019 •

edited

num_batches_per_epoch in network_trainer and low gpu-util #61

num_batches_per_epoch in network_trainer and low gpu-util #61

Comments

Zakiyi commented Sep 10, 2019 • edited

FabianIsensee commented Sep 11, 2019

Zakiyi commented Sep 12, 2019 • edited

FabianIsensee commented Sep 12, 2019 • edited

Zakiyi commented Sep 12, 2019

FabianIsensee commented Sep 12, 2019

FabianIsensee commented Sep 12, 2019

Zakiyi commented Sep 20, 2019 • edited

FabianIsensee commented Sep 20, 2019

Zakiyi commented Sep 22, 2019 • edited

FabianIsensee commented Sep 23, 2019

Zakiyi commented Sep 24, 2019 • edited

Zakiyi commented Sep 10, 2019 •

edited

Zakiyi commented Sep 12, 2019 •

edited

FabianIsensee commented Sep 12, 2019 •

edited

Zakiyi commented Sep 20, 2019 •

edited

Zakiyi commented Sep 22, 2019 •

edited

Zakiyi commented Sep 24, 2019 •

edited