New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need help building LVIS locally #5113
Comments
Thanks for reaching out. We can unfortunately not host prepared datasets ourselves because some datasets have specific licensing terms. As far as your issue is concerned, Beam usually allows to build much bigger datasets. So the issue seems to concern that one dataset only and the code in the dataset builder. We'd like to understand why it runs out-of-memory. Would it be possible for you to run a heap inspection [1] before it crashes, and report your findings? [1] Python has a few internals or libraries for this: |
@marcenacp, is the dataset hosted on cloud like aws or google cloud storage?? if it is we can directly download it form there and Since you're running into memory limitations we might consider using an EC2 instance with more memory, such as r6g.4xlarge or larger, to accommodate the dataset processing. |
I have the same problem
|
Hello, |
Killed after running 81 minutes |
I don't have any problems with the coco dataset.
|
@phamnhuvu-dev Okay, but I want to replicate the results on the LVIS dataset only. Any other workaround for this to bypass this issue? |
@rishabh-akridata I use the COCO dataset instead of LVIS dataset |
It would be super helpful if we can download pre-built LVIS val dataset in TFDS format, does anyone have the links for it? |
I tried increasing num workers in my machine as it has around 256 cores and 200GB memory, but still not able to build tfds dataset of LVIS val split. Can you please guide me, i require this tfds dataset of lvis val split
|
Screen.Recording.2024-03-17.at.13.20.37.mov@marcenacp There is a problem at the extraction step. |
What I need help with / What I was wondering
I need help downloading the LVIS dataset to my EC2 instance.
What I've tried so far
First, I copied the changes from #5094.
Then, I tried using the SDK to
download_and_prepare
the dataset as followsI also tried adding more parameters for the DirectRunner
After around 10ish minutes I can see 4 CPUs at near 100% utilization, so I think the builder is working. It runs for a while, 30 minutes to a couple hours depending on how many workers I specify, then either hits an error or runs out of memory and gets killed. If I remember correctly, this dataset is about ~25 GB in size. My machine has 64 GB of RAM.
It would be nice if...
It would be most convenient for me if I could just download an already built version of the dataset so I could avoid needing to build it myself. I don't really understand what goes on during the build. I just need this dataset locally in TFDS format so I can train a model that's been written to consume this dataset in this format. I'd rather not have to learn about Apache Beam and set up Google Cloud infrastructure just to get a 25 GB dataset.
If that's not possible, then it would be nice if I could build the LVIS dataset locally more easily.
Environment information
(if applicable)
The text was updated successfully, but these errors were encountered: