Fail to import large CSV, no documentation or reason #901

fulldecent · 2024-03-18T16:27:46Z

I am using Grist as recommended with this omnibus setup.

Importing a small CSV file was successful using this:

However, importing a large CSV file failed.

The file I need to load is: 200MB, 30 columns, 400,000 rows.

Work plan

Update that error message to link to this issue and/or a relevant documentation page
Document file size limits for import (if any) and any other exact, objective requirements on file input, link to these specifications from that error message and/or that documentation page
Update that error message to link to log files or show other specific information why the file was not imported

gabriel-v · 2024-03-20T04:39:58Z

Wanted to say the same thing. Running docker container capped at 10G ram.

Here is a 2GB csv file download:

https://catalog.data.gov/dataset/crimes-2001-to-present/resource/31b027d7-b633-4e82-ad2e-cfa5caaf5837

https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD

Tried both with and without sandboxing.

With sandbox, I get this on logs pretty quick:

2024-03-20 03:44:43.910 - warn: Sandbox unexpectedly exited with code 1 signal null sandboxPid=684, flavor=gvisor, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY
2024-03-20 03:44:43.927 - warn: Error during api call to /workspaces/2/import: Failed to parse CSV file.
Error: [Sandbox] PipeFromSandbox is closed:     raise Exception('gvisor runsc problem: ' + json.dumps(command))

I guess there's a low implicit memory limit for the sandbox?

With no sandbox, I get this:

2024-03-20 03:50:04.240 - warn: Sandbox unexpectedly exited with code null signal SIGKILL sandboxPid=23, flavor=unsandboxed, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY

and dmesg

[26021.168950] Memory cgroup out of memory: Killed process 137254 (python3.11) total-vm:10179908kB, anon-rss:10018296kB, file-rss:6912kB, shmem-rss:0kB, UID:0 pgtables:19992kB oom_score_adj:0

So it's using more than 10GB ram to parse the 2GB CSV file. Let's give it 20GB...

aaaand, boom

[26749.723969] Memory cgroup out of memory: Killed process 141164 (python3.11) total-vm:20545332kB, anon-rss:20259880kB, file-rss:7040kB, shmem-rss:0kB, UID:0 pgtables:40788kB oom_score_adj:0

Hey, it guessed the headers now...

Let's give it 40GB

No more OOM, that's nice. Container process seems to be using at most 21GB, even through the container group itself peaks at 31GB.

Still, after a couple minutes the UI breaks down:

And there's no data in the persist folder. That's probably why the recovery view is empty:

Ok, let's cut down on the file size. After truncating to first 600k rows (180MB):

Ram usage tops out at 10G (50X increase from file size).

Seems to be fine so far - but then shows this screen with no data, clicking the "new table" button on the left starts this spinner.

....

after 15min

GREAT SUCCESS! Still, clicking "ok" jumps into another spinner... I guess I'll wait 15min more.

Victory! We have almost a million chicago crimes now.

Here's the final ram usage for 180MB csv file:

Uploading 180MB csv takes 10GB server ram. Opening the doc then takes around 4GB of server ram, regardless of number of opened tabs for the same user.

Finally, the sqlite file is 2.1GB (13x amplification):

➜  persist du -hd1 docs/v*
2.1G	docs/vdvs6jqzwmuzosEcHdoJoY.grist

After uploading finished, I restarted the container with sandboxing enabled and reading & searching it works (while server container still takes 2-4GB of ram). If the sandbox has a low ram limit, I guess these 2-4GB of memory would be used by the nodejs server part?

I can see from this two things:

grist loads the CSV file completely into ram at upload time
grist then processes it in memory at over 50X amplification
- uploading all crimes requires >100GB server ram available to grist container

Both of these things can be fixed but it seems this was designed to fit everything into ram....

Related issues and comments

assuming no single document becomes too large

Yes but the crimes are happening at about 4MB CSV/day

@fulldecent turn off sandboxing and give your container at least 16gb of ram and it might work like above. Then turn sandboxing back on again.

Questions for the devs:

can we customize sandbox ram limit?
is there a plan to use streaming parsers e.g. pyexcel.iget_array() / npm csv-stream to accept files bigger than container ram? The upload could be cached on the filesystem in the persist folder using tempfile
can something be done about the large ram requirement for opening already uploaded files? It's 2x the size of the entire sqlite file.
~~is there a configurable upload file size limit? If not, I guess we can put some nginx in front with a 100MB upload limit, to maybe prevent the container using more than 10GB of ram~~ GRIST_MAX_UPLOAD_ATTACHMENT_MB GRIST_MAX_UPLOAD_IMPORT_MB
would you add the chicago file to your CI stress testing? These problems tend to come back unless tested for, it's really easy to regress into cloning 10s of GBs of data in a single a = list(b) line of code

fulldecent changed the title ~~Fail to import large CSV~~ Fail to import large CSV, no documentation or reason Mar 18, 2024

fulldecent mentioned this issue Mar 21, 2024

Support larger databases #43

Open

fulldecent mentioned this issue Apr 11, 2024

Docs: specify hardware requirements and supported file sizes nocodb/nocodb#8254

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to import large CSV, no documentation or reason #901

Fail to import large CSV, no documentation or reason #901

fulldecent commented Mar 18, 2024

gabriel-v commented Mar 20, 2024 •

edited

Fail to import large CSV, no documentation or reason #901

Fail to import large CSV, no documentation or reason #901

Comments

fulldecent commented Mar 18, 2024

Work plan

gabriel-v commented Mar 20, 2024 • edited

Questions for the devs:

gabriel-v commented Mar 20, 2024 •

edited