Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataParallel in test_set_evaluation.py #11

Open
giantke opened this issue Jul 29, 2023 · 2 comments
Open

DataParallel in test_set_evaluation.py #11

giantke opened this issue Jul 29, 2023 · 2 comments

Comments

@giantke
Copy link

giantke commented Jul 29, 2023

Hi Tim,
Congratulations for the awesome work.
I followed the workflow to run the test_set_evaluation.py for inference which costs me enormous sums of time (more than ten days). Thus, I tried to run it with multiple GPUs using model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3]) in the function get_model(), and it did work when I checked the occupation of GPUs (all of them were occupied).
However, I met a problem when it comes to the 56th iteration with batch_size=4, and also the 14th iteration with batch_size=16. It can not be coincidence since 564=1416, and it pointed to the 224th sample. The error was reported as follows:
56it [00:56, 1.02s/it] Traceback (most recent call last): File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 906, in <module> main() File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 902, in main evaluate_model_on_test_set(model, test_loader, test_2_loader, tokenizer) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 740, in evaluate_model_on_test_set obj_detector_scores, region_selection_scores, region_abnormal_scores = evaluate_obj_detector_and_binary_classifiers_on_test_set(model, test_loader, test_2_loader) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 714, in evaluate_obj_detector_and_binary_classifiers_on_test_set num_images = iterate_over_test_loader(test_2_loader, num_images, is_test_2_loader=True) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 631, in iterate_over_test_loader update_object_detector_metrics_test_loader_2(obj_detector_scores, detections, image_targets, class_detected) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 555, in update_object_detector_metrics_test_loader_2 intersection_area_per_region_batch, union_area_per_region_batch = compute_intersection_and_union_area_per_region(detections, image_targets, class_detected) File "/public/home/zhangke/rgrg-main/src/full_model/test_set_evaluation.py", line 518, in compute_intersection_and_union_area_per_region x0_max = torch.maximum(pred_boxes[..., 0], gt_boxes[..., 0]) # Error: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0 RuntimeError: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0

Also, I tried to print the shape of pred_boxes and gt_boxes in the function compute_intersection_and_union_area_per_region, and I got the results as :
55it [00:55, 1.18it/s] pred_boxes.size: torch.Size([4, 29, 4]) gt_boxes.size: torch.Size([4, 29, 4]) 56it [00:56, 1.19it/s] pred_boxes.size: torch.Size([3, 29, 4]) gt_boxes.size: torch.Size([4, 29, 4])
I was so confused why the batch size was not matched when it was correct with a single GPU.
Hope to hear from you soon.

@ttanida
Copy link
Owner

ttanida commented Jul 29, 2023

Hi @giantke,

let me address your concerns:

  1. Runtime: When we executed test_set_evaluation.py, our run took approximately 3 days using a Nvidia A40 (48GB RAM). Given the magnitude of the (huge) test set – with its 32,711 test images – the average time per image would thus be about 8 seconds. The language model component of our model, in particular, is quite bulky. This results in constraints around memory capacity, e.g. we could only train the full model with a batch size of 2. Thus, inference is also constrained by these memory limitations (I think we also used batch size of 4 for the test set run). I'm curious: which GPU were you using when the execution spanned 10 days?
  2. Batch Size Mismatch with Multiple GPUs: The discrepancy you observed in the batch size between using multiple GPUs and a single GPU is intriguing. I may not be an expert in multi-GPU scenarios, but I'd suggest narrowing down your investigation to the specific samples (220 - 224 in the test-2 csv file) that consistently cause the error. It's plausible that there could be synchronization issues or mismatches when consolidating results from different GPUs for those specific samples.
  3. Potential (Hacky) Solutions:
  • Single GPU Execution: Since there is no error when using a single GPU, a hacky solution might be to run evaluate_obj_detector_and_binary_classifiers_on_test_set on a single GPU and evaluate evaluate_language_model_on_test_set on multiple GPUs (since the latter is the resource-extensive part). You would have to modify the code accordingly.
  • Partial Single GPU Execution: Another viable approach could be to just process iterate_over_test_loader(test_2_loader, num_images, is_test_2_loader=True) within evaluate_obj_detector_and_binary_classifiers_on_test_set on a single GPU, as this is where the error happens and the test-2 csv file is much smaller than the test csv file. Again, this would require you to modify the code.

These solutions don't address the main (unknown) cause of the error, but this should probably work. I hope this provides some clarity and offers a direction to move forward. Please keep me posted on your progress and any further challenges.

@giantke
Copy link
Author

giantke commented Jul 30, 2023

Hi @ttanida,
Thank you very much for your timely and detailed reply.

  1. I was using RTX 2080Ti*4 (11G) which was quite limited. Thus I intend to try on a new RTX 4090 in the next week which computes much faster.
  2. I have checked the test-2 csv file around that batch and found nothing special. I agree that there must be some aggregation problems existing in the parallel process which has to be further investigated (I have not met that kind of problem before when doing multi-GPUs inference).
  3. Your suggestions are really helpful and I will try to run Partial Single GPU Execution later with the new graphics card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants