Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to import large CSV, no documentation or reason #901

Open
3 tasks
fulldecent opened this issue Mar 18, 2024 · 1 comment
Open
3 tasks

Fail to import large CSV, no documentation or reason #901

fulldecent opened this issue Mar 18, 2024 · 1 comment

Comments

@fulldecent
Copy link
Sponsor

I am using Grist as recommended with this omnibus setup.

Importing a small CSV file was successful using this:

Screenshot 2024-03-18 at 12 21 59

However, importing a large CSV file failed.

Screenshot 2024-03-18 at 12 21 53

The file I need to load is: 200MB, 30 columns, 400,000 rows.


Work plan

  • Update that error message to link to this issue and/or a relevant documentation page
  • Document file size limits for import (if any) and any other exact, objective requirements on file input, link to these specifications from that error message and/or that documentation page
  • Update that error message to link to log files or show other specific information why the file was not imported
@fulldecent fulldecent changed the title Fail to import large CSV Fail to import large CSV, no documentation or reason Mar 18, 2024
@gabriel-v
Copy link

gabriel-v commented Mar 20, 2024

Wanted to say the same thing. Running docker container capped at 10G ram.

Here is a 2GB csv file download:

https://catalog.data.gov/dataset/crimes-2001-to-present/resource/31b027d7-b633-4e82-ad2e-cfa5caaf5837

https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD


Tried both with and without sandboxing.

With sandbox, I get this on logs pretty quick:

2024-03-20 03:44:43.910 - warn: Sandbox unexpectedly exited with code 1 signal null sandboxPid=684, flavor=gvisor, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY
2024-03-20 03:44:43.927 - warn: Error during api call to /workspaces/2/import: Failed to parse CSV file.
Error: [Sandbox] PipeFromSandbox is closed:     raise Exception('gvisor runsc problem: ' + json.dumps(command))

I guess there's a low implicit memory limit for the sandbox?


With no sandbox, I get this:

image

2024-03-20 03:50:04.240 - warn: Sandbox unexpectedly exited with code null signal SIGKILL sandboxPid=23, flavor=unsandboxed, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY

and dmesg

[26021.168950] Memory cgroup out of memory: Killed process 137254 (python3.11) total-vm:10179908kB, anon-rss:10018296kB, file-rss:6912kB, shmem-rss:0kB, UID:0 pgtables:19992kB oom_score_adj:0

So it's using more than 10GB ram to parse the 2GB CSV file. Let's give it 20GB...

Screenshot from 2024-03-20 06-02-13

aaaand, boom

[26749.723969] Memory cgroup out of memory: Killed process 141164 (python3.11) total-vm:20545332kB, anon-rss:20259880kB, file-rss:7040kB, shmem-rss:0kB, UID:0 pgtables:40788kB oom_score_adj:0

Screenshot from 2024-03-20 06-02-50

Hey, it guessed the headers now...


Let's give it 40GB

Screenshot from 2024-03-20 06-23-32

No more OOM, that's nice. Container process seems to be using at most 21GB, even through the container group itself peaks at 31GB.

Still, after a couple minutes the UI breaks down:

Screenshot from 2024-03-20 06-08-55

And there's no data in the persist folder. That's probably why the recovery view is empty:

Screenshot from 2024-03-20 06-26-36


Ok, let's cut down on the file size. After truncating to first 600k rows (180MB):

Screenshot from 2024-03-20 06-34-45

Ram usage tops out at 10G (50X increase from file size).

Seems to be fine so far - but then shows this screen with no data, clicking the "new table" button on the left starts this spinner.

Screenshot from 2024-03-20 06-35-58

....

after 15min

Screenshot from 2024-03-20 06-41-40

GREAT SUCCESS! Still, clicking "ok" jumps into another spinner... I guess I'll wait 15min more.

Screenshot from 2024-03-20 06-46-25

Victory! We have almost a million chicago crimes now.

Here's the final ram usage for 180MB csv file:

Screenshot from 2024-03-20 06-49-30

Uploading 180MB csv takes 10GB server ram. Opening the doc then takes around 4GB of server ram, regardless of number of opened tabs for the same user.

Finally, the sqlite file is 2.1GB (13x amplification):

➜  persist du -hd1 docs/v*
2.1G	docs/vdvs6jqzwmuzosEcHdoJoY.grist

After uploading finished, I restarted the container with sandboxing enabled and reading & searching it works (while server container still takes 2-4GB of ram). If the sandbox has a low ram limit, I guess these 2-4GB of memory would be used by the nodejs server part?


I can see from this two things:

  • grist loads the CSV file completely into ram at upload time
  • grist then processes it in memory at over 50X amplification
    • uploading all crimes requires >100GB server ram available to grist container

Both of these things can be fixed but it seems this was designed to fit everything into ram....


Related issues and comments

assuming no single document becomes too large

Yes but the crimes are happening at about 4MB CSV/day


@fulldecent turn off sandboxing and give your container at least 16gb of ram and it might work like above. Then turn sandboxing back on again.

Questions for the devs:

  • can we customize sandbox ram limit?
  • is there a plan to use streaming parsers e.g. pyexcel.iget_array() / npm csv-stream to accept files bigger than container ram? The upload could be cached on the filesystem in the persist folder using tempfile
  • can something be done about the large ram requirement for opening already uploaded files? It's 2x the size of the entire sqlite file.
  • is there a configurable upload file size limit? If not, I guess we can put some nginx in front with a 100MB upload limit, to maybe prevent the container using more than 10GB of ram GRIST_MAX_UPLOAD_ATTACHMENT_MB GRIST_MAX_UPLOAD_IMPORT_MB
  • would you add the chicago file to your CI stress testing? These problems tend to come back unless tested for, it's really easy to regress into cloning 10s of GBs of data in a single a = list(b) line of code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants