Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening more than 256 files or file descriptors at once? #63

Open
brainstorm opened this issue May 23, 2022 · 3 comments
Open

Opening more than 256 files or file descriptors at once? #63

brainstorm opened this issue May 23, 2022 · 3 comments

Comments

@brainstorm
Copy link

brainstorm commented May 23, 2022

The default top limit of file descriptors as of OSX Monterey 12.4 seems to be 256 file descriptors (a bit low when compared with the linux one, which is 1024, IIRC), but regardless:

(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet
open SBJ02239_PRJ221187.parquet: too many open files
(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet "SELECT * FROM {} LIMIT 10"
open SBJ02239_PRJ221187.parquet: too many open files
(base) rvalls@m1 out % ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2666
-n: file descriptors                256
(base) rvalls@m1 out % ulimit -n 8912
(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet "SELECT * FROM {} LIMIT 10"
(cannot show the actual contents of the file, but it works after `ulimit`)

How come so many (intermediate?) files are required to open a regular .parquet file around the ~200KB filesize mark?

@eatonphil
Copy link
Member

Hey thanks for the report! Macs do have relatively low default file descriptor settings compared to Linux like you say.

I'm not sure why the parquet library opens too many files.

Is the parquet file something you can share?

I'm happy to leave this open but especially since there's a workaround (adjust ulimit) I probably won't get around to looking into this for a while.

@brainstorm
Copy link
Author

brainstorm commented May 25, 2022

Sorry can't share the contents for that specific file, but OSX Instruments tells me that dsq performs ~343 open() syscalls, most of them on the same input file, thus creating a ton of file descriptors :-!... why can't the same fd be shared?:

Screen Shot 2022-05-25 at 3 17 49 pm

The file contains around 1500 columns in its original .tsv form. Not sure I'll have time to put together a reproducer with non-private data, but I hope that serves as some kind of hint on what might be going wrong? Adjusting ulimit seems like a bad patch for what it seems to be an underlying library issue?

@eatonphil
Copy link
Member

Yeah that doesn't seem great, but I'm not familiar with parquet internals nor the internals of the particular library datastation/dsq uses.

If you'd like you can open up a bug report with https://github.com/xitongsys/parquet-go right now and ask them.

When I get around to finding a case they can reproduce this behavior with I'd open a bug ticket myself (if you haven't by then).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants