Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large files fail to parse #922

Open
apkrieg opened this issue Apr 11, 2024 · 5 comments
Open

Very large files fail to parse #922

apkrieg opened this issue Apr 11, 2024 · 5 comments

Comments

@apkrieg
Copy link

apkrieg commented Apr 11, 2024

Describe the bug

I'm currently trying to parse a very large YAML file, approximately 4.8 GB, and it's failing.

UPDATE: I was able to parse the 4.8 GB file by changing the type of Index from int to long on Mark, Cursor, and SimpleKey.

To Reproduce

Try to parse a YAML file that is more than ~2.2 GB in size.

@apkrieg
Copy link
Author

apkrieg commented Apr 11, 2024

If this is something that would be accepted in a PR, I'd love to contribute, but I believe it would break things like Mark.

@apkrieg
Copy link
Author

apkrieg commented Apr 12, 2024

I plan to make a PR for this sometime next week. If this change would not be accepted because it could be breaking, please close this so I don't waste the time. Cheers!

@EdwardCooke
Copy link
Collaborator

Initially I think converting ints to longs where needed would be fine. I can’t really see much of a downside other than the potential for slightly increased memory footprint (4 bytes vs 8). Would be interesting to compare the numbers in your use case of your massive file.

@EdwardCooke
Copy link
Collaborator

If you don't have time to work on this, then I may be able to find time. Do you have a download link for that yaml file you could share, I'd like to compare apples to apples when doing this work.

@lahma
Copy link
Contributor

lahma commented May 11, 2024

Sounds quite natural that int won't fit a index large enough when parsing such a huge file. Unfortunately changing ints to longs will have negative performance impact for smaller files.. I'm not entirely sure if such big files ought to be supported without them being split to logical parts. Many parsers do not expect input files to be that large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants