Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch2.0 CUDA runtime error during NNCF optimization of ROIAlign MMCV kernel for MaskRCNN #2451

Open
goodsong81 opened this issue Aug 24, 2023 · 0 comments

Comments

@goodsong81
Copy link
Contributor

Describe the bug

While verifying the torch version upgrade from 1.13.1 to 2.0.1, there was integration test error(s) regarding NNCF optimize.

[Error log from CI run]

Traceback (most recent call last):
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/bin/otx", line 8, in <module>
    sys.exit(main())
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/cli/tools/cli.py", line 77, in main
    results = globals()[f"otx_{name}"]()
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/cli/tools/optimize.py", line 146, in main
    predicted_validation_dataset = task.infer(
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/detection/task.py", line 300, in infer
    prediction_results, _ = self._infer_model(dataset, inference_parameters)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/detection/adapters/mmdet/task.py", line 429, in _infer_model
    eval_predictions = single_gpu_test(model, dataloader)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/apis/test.py", line 29, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 131, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmcv/parallel/data_parallel.py", line 51, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/nncf_network.py", line 886, in __call__
    return ORIGINAL_CALL(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/nncf_network.py", line 906, in forward
    retval = wrap_module_call(self.nncf._original_unbound_forward)(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 151, in wrapped
    retval = module_call(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/detection/adapters/mmdet/models/detectors/custom_maskrcnn_tile_optimized.py", line 198, in simple_test
    x = self.extract_feat(img)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/models/detectors/two_stage.py", line 67, in extract_feat
    x = self.backbone(img)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 151, in wrapped
    retval = module_call(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1547, in _call_impl
    hook_result = hook(self, args, result)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/common/adapters/mmcv/hooks/recording_forward_hook.py", line 75, in _recording_forward
    tensors = self.func(output)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/detection/adapters/mmdet/hooks/det_class_probability_map_hook.py", line 177, in func
    saliency_maps = self._get_saliency_maps_from_mask_predictions(feature_map, det_bboxes, det_labels)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/otx/algorithms/detection/adapters/mmdet/hooks/det_class_probability_map_hook.py", line 206, in _get_saliency_maps_from_mask_predictions
    mask_results = self._module.roi_head._mask_forward(x, mask_rois)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/models/roi_heads/standard_roi_head.py", line 186, in _mask_forward
    mask_feats = self.mask_roi_extractor(
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 151, in wrapped
    retval = module_call(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 208, in new_func
    return old_func(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py", line 93, in forward
NNCF relies on custom-wrapping the `forward` call in order to function properly.
Arbitrary adjustments to the forward function on an NNCFNetwork object have undefined behaviour.
If you need to replace the underlying forward function of the original model so that NNCF should be using that instead of the original forward function that NNCF saved during the compressed model creation, you can do this by calling:
model.nncf.set_original_unbound_forward(fn)
    roi_feats_t = self.roi_layers[i](feats[i], rois_i)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 151, in wrapped
    retval = module_call(self, *args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmcv/ops/roi_align.py", line 215, in forward
    return roi_align(input, rois, self.output_size, self.spatial_scale,
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
if `fn` has an unbound 0-th `self` argument, or
with model.nncf.temporary_bound_original_forward(fn): ...
if `fn` already had 0-th `self` argument bound or never had it in the first place.
2023-08-23 13:14:23,669 | INFO : ----------------- CustomMaskRCNN.load_state_dict_pre_hook() called w/ prefix: 
2023-08-23 13:14:23,674 | INFO : ['rectangle', 'ellipse', 'triangle'] -> ['rectangle', 'ellipse', 'triangle'] ([0, 1, 2])
INFO:nncf:Loaded 1186/1186 parameters
2023-08-23 13:14:23,985 | INFO : ----------------- CustomMaskRCNN.load_state_dict_pre_hook() called w/ prefix: 
2023-08-23 13:14:23,990 | INFO : ['rectangle', 'ellipse', 'triangle'] -> ['rectangle', 'ellipse', 'triangle'] ([0, 1, 2])
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/validation/actions-runner/_work/training_extensions/training_extensions/.tox/tests-iseg-py310/lib/python3.10/site-packages/mmcv/ops/roi_align.py", line 95, in forward
    ext_module.roi_align_forward(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps to Reproduce

git fetch -p && git checkout 1e208c6e090e7695057b8fb9b7424ec6fdce1a25
# poc/torch2.0 might be skipping the failing tests
tox -e tests-all-py310 -- tests/integration/cli/instance_segmentation/test_instance_segmentation.py -k Custom_Counting_Instance_Segmentation_MaskRCNN_ResNet50

Environment:

  • OS: Ubuntu 20.04
  • Framework version: Torch2.0
  • Python version: 3.10
  • OpenVINO version: 2023.0
  • CUDA/cuDNN version: 11.7
  • GPU model and memory: 3090 / 24G
@goodsong81 goodsong81 changed the title Torch2.0 CUDA runtime error during NNCF optimization of MaskRCNN at ROIAlign MMCV kernel Torch2.0 CUDA runtime error during NNCF optimization of ROIAlign MMCV kernel for MaskRCNN Aug 24, 2023
@goodsong81 goodsong81 mentioned this issue Sep 19, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant