Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Allow configuration of MAX_IMAGE_PIXELS in ImageDatasource #45415

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andrewsykim
Copy link
Contributor

Why are these changes needed?

PIL.Image sets a default limit to image sizes to prevent decompression bomb DOS attacks. Example stack trace:

  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/image_datasource.py", line 81, in _read_stream
    image = Image.open(io.BytesIO(data))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/PIL/Image.py", line 3133, in open
    im = _open_core(fp, filename, prefix, formats)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/PIL/Image.py", line 3120, in _open_core
    _decompression_bomb_check(im.size)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/PIL/Image.py", line 3029, in _decompression_bomb_check
    raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (222926661 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

However, the limit should be configurable when processing images with a trusted dataset and users should be able to configure the limit when calling ray.data.read_images.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@can-anyscale
Copy link
Collaborator

can-anyscale commented May 17, 2024

i rebased the pr to the latest of master to avoid some bugs in ci, my apologies if this interrupt your work in any ways, thankks

@andrewsykim andrewsykim changed the title [WIP][Data] Allow configuration of MAX_IMAGE_PIXELS in ImageDatasource [Data] Allow configuration of MAX_IMAGE_PIXELS in ImageDatasource May 17, 2024
@andrewsykim andrewsykim force-pushed the max-image-pixels branch 2 times, most recently from b2aeee2 to 75a2c81 Compare May 17, 2024 18:30
@andrewsykim andrewsykim marked this pull request as ready for review May 17, 2024 18:50
@andrewsykim andrewsykim force-pushed the max-image-pixels branch 3 times, most recently from a44f713 to e881c58 Compare May 17, 2024 20:29
@andrewsykim andrewsykim marked this pull request as draft May 17, 2024 21:35
@andrewsykim andrewsykim force-pushed the max-image-pixels branch 3 times, most recently from 990cac5 to 06c7488 Compare May 18, 2024 05:44
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim andrewsykim marked this pull request as ready for review May 19, 2024 02:00
@andrewsykim
Copy link
Contributor Author

I tried to add a unit test but I don't think they exercise reading actual images. I manually ran this script to test the changes:

import ray

DATA_URI = "gs://anonymous@ray-images/images"

def main():
    ray.init()
    dataset = ray.data.read_images(DATA_URI, include_paths=True, max_image_pixels=4)

    dataset_iter = dataset.iter_batches(batch_size=None)
    for _ in dataset_iter:
        pass

if __name__ == "__main__":
    main()

Exception raised:

  File "/home/andrewsy/go/src/github.com/ray-project/ray/myenv/lib/python3.11/site-packages/PIL/Image.py", line 3120, in _open_core
    _decompression_bomb_check(im.size)
  File "/home/andrewsy/go/src/github.com/ray-project/ray/myenv/lib/python3.11/site-packages/PIL/Image.py", line 3029, in _decompression_bomb_check
    raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (47160 pixels) exceeds limit of 8 pixels, could be decompression bomb DOS attack.

@kevin85421
Copy link
Member

cc @c21 would you mind reviewing this PR? Thanks!

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a test?

@@ -75,6 +77,9 @@ def _read_stream(
) -> Iterator[Block]:
from PIL import Image, UnidentifiedImageError

if self.max_image_pixels is not None:
Image.MAX_IMAGE_PIXELS = self.max_image_pixels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this updates the the global variable, and will also affect the next read_images call.
Is there a way to avoid this?

@@ -791,6 +791,7 @@ def read_images(
file_extensions: Optional[List[str]] = ImageDatasource._FILE_EXTENSIONS,
concurrency: Optional[int] = None,
override_num_blocks: Optional[int] = None,
max_image_pixels: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this arg to something like pil_image_args? so we don't need to keep adding new parameters in the future.

@andrewsykim
Copy link
Contributor Author

can you also add a test?

I tried to add a unit test but I couldn't get it passing because it didn't seem like the test actually reads images. Can you point me to an example unit test that is actually read images?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants