Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TensorFlow - CUDA: multiprocessing does not work as expected - Dataloader and inference pipeline #1440

Open
Lubhawan opened this issue Jan 25, 2024 · 4 comments
Labels
framework: tensorflow Related to TensorFlow backend module: transforms Related to doctr.transforms type: bug Something isn't working
Milestone

Comments

@Lubhawan
Copy link

Bug description

expected the model to run successfully but it is throwing error JIT compliation failed (while running on gpu).

Code snippet to reproduce the bug

model = ocr_predictor(det_arch = 'linknet_resnet18', reco_arch = 'crnn_vgg16_bn', pretrained = True)
img_path = "/home/lubhawan/Downloads/iloveimg-converted/Hospital-Bill-4.jpg" #Specify your image path here
img = DocumentFile.from_images(img_path)
result = model(img)

Error traceback

UnknownError Traceback (most recent call last)
Cell In[5], line 3
1 img_path = "/home/lubhawan/Downloads/iloveimg-converted/Hospital-Bill-4.jpg" #Specify your image path here
2 img = DocumentFile.from_images(img_path)
----> 3 result = model(img)
4 output = result.export()

File ~/.local/lib/python3.11/site-packages/doctr/models/predictor/tensorflow.py:89, in OCRPredictor.call(self, pages, **kwargs)
86 pages = [rotate_image(page, -angle, expand=True) for page, angle in zip(pages, origin_page_orientations)]
88 # Localize text elements
---> 89 loc_preds_dict = self.det_predictor(pages, **kwargs)
90 assert all(
91 len(loc_pred) == 1 for loc_pred in loc_preds_dict
92 ), "Detection Model in ocr_predictor should output only one class"
94 loc_preds: List[np.ndarray] = [list(loc_pred.values())[0] for loc_pred in loc_preds_dict]

File ~/.local/lib/python3.11/site-packages/doctr/models/detection/predictor/tensorflow.py:45, in DetectionPredictor.call(self, pages, **kwargs)
42 if any(page.ndim != 3 for page in pages):
43 raise ValueError("incorrect input shape: all pages are expected to be multi-channel 2D images.")
---> 45 processed_batches = self.pre_processor(pages)
46 predicted_batches = [
47 self.model(batch, return_preds=True, training=False, **kwargs)["preds"] for batch in processed_batches
48 ]
49 return [pred for batch in predicted_batches for pred in batch]

File ~/.local/lib/python3.11/site-packages/doctr/models/preprocessor/tensorflow.py:111, in PreProcessor.call(self, x)
107 batches = [x]
109 elif isinstance(x, list) and all(isinstance(sample, (np.ndarray, tf.Tensor)) for sample in x):
110 # Sample transform (to tensor, resize)
--> 111 samples = list(multithread_exec(self.sample_transforms, x))
112 # Batching
113 batches = self.batch_inputs(samples)

File ~/.local/lib/python3.11/site-packages/doctr/utils/multithreading.py:47, in multithread_exec(func, seq, threads)
42 # Multi-threading
43 else:
44 with ThreadPool(threads) as tp:
45 # ThreadPool's map function returns a list, but seq could be of a different type
46 # That's why wrapping result in map to return iterator
---> 47 results = map(lambda x: x, tp.map(func, seq))
48 return results

File ~/anconda3/lib/python3.11/multiprocessing/pool.py:367, in Pool.map(self, func, iterable, chunksize)
362 def map(self, func, iterable, chunksize=None):
363 '''
364 Apply func to each element in iterable, collecting the results
365 in a list that is returned.
366 '''
--> 367 return self._map_async(func, iterable, mapstar, chunksize).get()

File ~/anconda3/lib/python3.11/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout)
772 return self._value
773 else:
--> 774 raise self._value

File ~/anconda3/lib/python3.11/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
123 job, i, func, args, kwds = task
124 try:
--> 125 result = (True, func(*args, **kwds))
126 except Exception as e:
127 if wrap_exception and func is not _helper_reraises_exception:

File ~/anconda3/lib/python3.11/multiprocessing/pool.py:48, in mapstar(args)
47 def mapstar(args):
---> 48 return list(map(*args))

File ~/.local/lib/python3.11/site-packages/doctr/models/preprocessor/tensorflow.py:76, in PreProcessor.sample_transforms(self, x)
74 x = tf.image.convert_image_dtype(x, dtype=tf.float32)
75 # Resizing
---> 76 x = self.resize(x)
78 return x

File ~/.local/lib/python3.11/site-packages/doctr/transforms/modules/tensorflow.py:107, in Resize.call(self, img, target)
100 def call(
101 self,
102 img: tf.Tensor,
103 target: Optional[np.ndarray] = None,
104 ) -> Union[tf.Tensor, Tuple[tf.Tensor, np.ndarray]]:
105 input_dtype = img.dtype
--> 107 img = tf.image.resize(img, self.wanted_size, self.method, self.preserve_aspect_ratio)
108 # It will produce an un-padded resized image, with a side shorter than wanted if we preserve aspect ratio
109 raw_shape = img.shape[:2]

File ~/.local/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback..error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.traceback)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb

File ~/.local/lib/python3.11/site-packages/tensorflow/python/framework/ops.py:5883, in raise_from_not_ok_status(e, name)
5881 def raise_from_not_ok_status(e, name) -> NoReturn:
5882 e.message += (" name: " + str(name if name is not None else ""))
-> 5883 raise core._status_to_exception(e) from None

UnknownError: {{function_node _wrapped__Round_device/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Round] name:

Environment

DocTR version: v0.7.0
TensorFlow version: 2.15.0
PyTorch version: 2.1.2+cu121 (torchvision 0.16.2+cu121)
OpenCV version: 4.9.0
OS: Ubuntu 22.04.3 LTS
Python version: 3.11.5
Is CUDA available (TensorFlow): Yes
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4070
Nvidia driver version: 535.154.05
cuDNN version: Could not collect

Deep Learning backend

is_tf_available: True
is_torch_available: True

@Lubhawan Lubhawan added the type: bug Something isn't working label Jan 25, 2024
@felixdittrich92
Copy link
Contributor

Hi @Lubhawan 👋

Thanks for reporting this.
We have faiced this issue already it comes from the transformations (only on CUDA on CPU everything works as expected).

I will update your report a bit

TensorFlow (only on CUDA)
Affected transformations: Resize, Shadow, Blur

@felixdittrich92
Copy link
Contributor

Could you try it again please with disabling multiprocessing ?

DOCTR_MULTIPROCESSING_DISABLE=TRUE

@felixdittrich92 felixdittrich92 changed the title JIT compilation failed while running on gpu [Bug] TensorFlow: Resize, Blur, Shadow transformations raises exception only on CUDA Jan 26, 2024
@felixdittrich92 felixdittrich92 added module: transforms Related to doctr.transforms framework: tensorflow Related to TensorFlow backend labels Jan 26, 2024
@felixdittrich92 felixdittrich92 added this to the 0.9.0 milestone Jan 26, 2024
@Lubhawan
Copy link
Author

Yeah, It is working fine on cpu

@felixdittrich92 felixdittrich92 modified the milestones: 0.9.0, 1.0.0 Feb 9, 2024
@felixdittrich92
Copy link
Contributor

Related to used multiprocessing in dataloader and pipeline

@felixdittrich92 felixdittrich92 changed the title [Bug] TensorFlow: Resize, Blur, Shadow transformations raises exception only on CUDA [Bug] TensorFlow - CUDA: multiprocessing does not work as expected - Dataloader and inference pipeline May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework: tensorflow Related to TensorFlow backend module: transforms Related to doctr.transforms type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants