New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to manage empty items in a batch? #5251
Comments
Hi @bveldhoen, Thank you for reaching out. DALI operators not necessarily can handle empty samples. Some do, some don't.
|
Hi @JanuszL, Thanks for your response. In our scenario, any number of items in the batch could be 'empty' (or not). For instance, in a batch of 4, only 1 item could be empty, with the rest containing valid items. This cascades through the subsequent operators, where each operator, that receives an empty (or invalid) item, should produce an empty (or invalid) item at the same index in the resulting batch (or resulting batches, if the number of outputs of the operator > 1). Using conditional execution will stop the execution of the entire batch, which is not the goal in our scenario. I think this could be implemented in a straightforward way by allowing batch items to be None (in Python), with a corresponding implementation in C++ (for instance, an is_empty flag on Tensor, or something similar?). This would require each operator to check each item for emptiness during processing, which might require a lot of changes. For now, I'll continue using signal values (arrays filled with -1, or with a shape with first dimension set to 0). Thanks! |
Despite the convenient Python syntax (if/else) the conditional execution works per sample. So:
each sample can take a separate execution path. Under the hood split/merge operators are added that partition samples according to the condition. It doesn't stop the execution it just redirects samples in different directions. |
I see! I didn't know that the conditional execution was per sample, thanks for the clarification. I'll give it a try (next week). Will this work with a fn.python_function with batch_processing=True? (or is it required to do the check within the called python function in this case?) Would this work with an additional output, containing a batch with Tensors, which contain True/False?
|
As far as I understand it should. However each time the Python function will get only samples in a given condition branch - between 0 and batch size.
I think that in both branches the variable needs to be defined. |
Describe the question.
Thanks in advance for your help.
I'm running into an issue in a pipeline with ~11 operators. During processing, some processing steps may become irrelevant for certain items in the batch. For these empty batch items, processing should be skipped for any subsequent operators.
Currently, it seems to be required to implement workarounds for this by setting the returned Tensors to contain signal values. For some operators, I can get away by returning a Tensor (i.e. a torch Tensor from a fn.torch_python_function, or a cupy.ndarray from a fn.python_function) with a shape with first dimension set to 0, for instance (0, 640, 640, 3). But this does not always work (some operators raise exceptions), and it has been required to return bogus arrays containing -1 values in some cases. In custom operators, it is then required to test for these signal values, and to skip processing and return empty values for these empty batch items.
Below a code snippet to (hopefully) clarify:
Note that this implementation uses batch_processing=True. Would this be different/improved if using batch_processing=False? (i.e. does DALI then check for empty/None batch items?)
In general, what is the correct approach to deal with empty batch items?
Check for duplicates
The text was updated successfully, but these errors were encountered: