DataParallel in test_set_evaluation.py #11

giantke · 2023-07-29T05:38:20Z

Hi Tim,
Congratulations for the awesome work.
I followed the workflow to run the test_set_evaluation.py for inference which costs me enormous sums of time (more than ten days). Thus, I tried to run it with multiple GPUs using model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3]) in the function get_model(), and it did work when I checked the occupation of GPUs (all of them were occupied).
However, I met a problem when it comes to the 56th iteration with batch_size=4, and also the 14th iteration with batch_size=16. It can not be coincidence since 564=1416, and it pointed to the 224th sample. The error was reported as follows:
56it [00:56, 1.02s/it] Traceback (most recent call last): File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 906, in <module> main() File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 902, in main evaluate_model_on_test_set(model, test_loader, test_2_loader, tokenizer) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 740, in evaluate_model_on_test_set obj_detector_scores, region_selection_scores, region_abnormal_scores = evaluate_obj_detector_and_binary_classifiers_on_test_set(model, test_loader, test_2_loader) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 714, in evaluate_obj_detector_and_binary_classifiers_on_test_set num_images = iterate_over_test_loader(test_2_loader, num_images, is_test_2_loader=True) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 631, in iterate_over_test_loader update_object_detector_metrics_test_loader_2(obj_detector_scores, detections, image_targets, class_detected) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 555, in update_object_detector_metrics_test_loader_2 intersection_area_per_region_batch, union_area_per_region_batch = compute_intersection_and_union_area_per_region(detections, image_targets, class_detected) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 518, in compute_intersection_and_union_area_per_region x0_max = torch.maximum(pred_boxes[..., 0], gt_boxes[..., 0]) # Error: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0 RuntimeError: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0

Also, I tried to print the shape of pred_boxes and gt_boxes in the function compute_intersection_and_union_area_per_region, and I got the results as :
55it [00:55, 1.18it/s] pred_boxes.size: torch.Size([4, 29, 4]) gt_boxes.size: torch.Size([4, 29, 4]) 56it [00:56, 1.19it/s] pred_boxes.size: torch.Size([3, 29, 4]) gt_boxes.size: torch.Size([4, 29, 4])
I was so confused why the batch size was not matched when it was correct with a single GPU.
Hope to hear from you soon.

The text was updated successfully, but these errors were encountered:

ttanida · 2023-07-29T19:57:28Z

Hi @giantke,

let me address your concerns:

Runtime: When we executed test_set_evaluation.py, our run took approximately 3 days using a Nvidia A40 (48GB RAM). Given the magnitude of the (huge) test set – with its 32,711 test images – the average time per image would thus be about 8 seconds. The language model component of our model, in particular, is quite bulky. This results in constraints around memory capacity, e.g. we could only train the full model with a batch size of 2. Thus, inference is also constrained by these memory limitations (I think we also used batch size of 4 for the test set run). I'm curious: which GPU were you using when the execution spanned 10 days?
Batch Size Mismatch with Multiple GPUs: The discrepancy you observed in the batch size between using multiple GPUs and a single GPU is intriguing. I may not be an expert in multi-GPU scenarios, but I'd suggest narrowing down your investigation to the specific samples (220 - 224 in the test-2 csv file) that consistently cause the error. It's plausible that there could be synchronization issues or mismatches when consolidating results from different GPUs for those specific samples.
Potential (Hacky) Solutions:

Single GPU Execution: Since there is no error when using a single GPU, a hacky solution might be to run evaluate_obj_detector_and_binary_classifiers_on_test_set on a single GPU and evaluate evaluate_language_model_on_test_set on multiple GPUs (since the latter is the resource-extensive part). You would have to modify the code accordingly.
Partial Single GPU Execution: Another viable approach could be to just process iterate_over_test_loader(test_2_loader, num_images, is_test_2_loader=True) within evaluate_obj_detector_and_binary_classifiers_on_test_set on a single GPU, as this is where the error happens and the test-2 csv file is much smaller than the test csv file. Again, this would require you to modify the code.

These solutions don't address the main (unknown) cause of the error, but this should probably work. I hope this provides some clarity and offers a direction to move forward. Please keep me posted on your progress and any further challenges.

giantke · 2023-07-30T01:45:58Z

Hi @ttanida,
Thank you very much for your timely and detailed reply.

I was using RTX 2080Ti*4 (11G) which was quite limited. Thus I intend to try on a new RTX 4090 in the next week which computes much faster.
I have checked the test-2 csv file around that batch and found nothing special. I agree that there must be some aggregation problems existing in the parallel process which has to be further investigated (I have not met that kind of problem before when doing multi-GPUs inference).
Your suggestions are really helpful and I will try to run Partial Single GPU Execution later with the new graphics card.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParallel in test_set_evaluation.py #11

DataParallel in test_set_evaluation.py #11

giantke commented Jul 29, 2023 •

edited

ttanida commented Jul 29, 2023

giantke commented Jul 30, 2023 •

edited

DataParallel in test_set_evaluation.py #11

DataParallel in test_set_evaluation.py #11

Comments

giantke commented Jul 29, 2023 • edited

ttanida commented Jul 29, 2023

giantke commented Jul 30, 2023 • edited

giantke commented Jul 29, 2023 •

edited

giantke commented Jul 30, 2023 •

edited