Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsq --schema missing array in 11GB file #87

Open
mccorkle opened this issue Jul 21, 2022 · 2 comments
Open

dsq --schema missing array in 11GB file #87

mccorkle opened this issue Jul 21, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@mccorkle
Copy link

Describe the bug and expected behavior

In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.

Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?

Reproduction steps
With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json

Versions

  • OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
  • Shell: bash
  • dsq version: dsq 0.20.2 from apt
@mccorkle mccorkle added the bug Something isn't working label Jul 21, 2022
@eatonphil
Copy link
Member

Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.

@mccorkle
Copy link
Author

Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.

The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.

Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3 argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants