Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] num_rows_per_file doesn't work with small values #45393

Open
bveeramani opened this issue May 16, 2024 · 0 comments
Open

[Data] num_rows_per_file doesn't work with small values #45393

bveeramani opened this issue May 16, 2024 · 0 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

My dataset contains 28k rows. I tried writing this to Parquet with num_rows_per_file=700. I expected several Parquet files each with 700 rows, but instead got a single file with all the rows.

Versions / Dependencies

4d37e55

Reproduction script

import os

import ray

ray.data.range(100, override_num_blocks=1).write_parquet(
    "/tmp/results", num_rows_per_file=10
)


print(os.listdir("/tmp/results"))

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

1 participant