-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-exec: possible scaling issue managing many job shells #5919
Comments
Hmm, it seems like the output We might want to see what the profile looks like with the following change (all subprocess unit tests pass wtih this): diff --git a/src/common/libsubprocess/remote.c b/src/common/libsubprocess/remote.c
index 3dd9a3ca1..2e009c994 100644
--- a/src/common/libsubprocess/remote.c
+++ b/src/common/libsubprocess/remote.c
@@ -56,7 +56,6 @@ static void start_channel_watchers (flux_subprocess_t *p)
flux_watcher_start (c->in_prep_w);
flux_watcher_start (c->in_check_w);
flux_watcher_start (c->out_prep_w);
- flux_watcher_start (c->out_check_w);
c = zhash_next (p->channels);
}
}
@@ -303,8 +302,10 @@ static void remote_out_prep_cb (flux_reactor_t *r,
* reactors are closed */
if ((c->line_buffered && fbuf_has_line (c->read_buffer))
|| (!c->line_buffered && fbuf_bytes (c->read_buffer) > 0)
- || (c->read_eof_received && !c->eof_sent_to_caller))
+ || (c->read_eof_received && !c->eof_sent_to_caller)) {
flux_watcher_start (c->out_idle_w);
+ flux_watcher_start (c->out_check_w);
+ }
}
static void remote_out_check_cb (flux_reactor_t *r,
@@ -315,6 +316,7 @@ static void remote_out_check_cb (flux_reactor_t *r,
struct subprocess_channel *c = arg;
flux_watcher_stop (c->out_idle_w);
+ flux_watcher_stop (c->out_check_w);
if ((c->line_buffered
&& (fbuf_has_line (c->read_buffer)
@@ -336,7 +338,6 @@ static void remote_out_check_cb (flux_reactor_t *r,
* reactors are closed */
if (c->eof_sent_to_caller) {
flux_watcher_stop (c->out_prep_w);
- flux_watcher_stop (c->out_check_w);
/* close input side as well */
flux_watcher_stop (c->in_prep_w); @chu11 I think you are the most familiar with this code. Does this analysis sound right? Edit: I am referring to this code |
@garlick yeah, that analysis looks right to me. I think we could also do the "in" check watcher as well. I remember the it's a bug chunk of checks flux-core/src/common/liblsd/cbuf.c Line 1760 in bff958b
that's another thing we could tweak if we wanted to. |
Looks like cbuf complies that out if NDBEBUG is set, along with other checks. Maybe we should be building it with that set. Did you want to propose a PR for all this? |
sure lemme give it a shot |
another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)? |
I retried with @garlick's patch above which didn't seem to improve things. The broker is still spending a majority of its time in these three
And I'm fairly certain I recompiled, but let me check again after a |
Ah this seems to have gotten I guess |
Hmmmm. It seems that we don't pass LINE_BUFFERED to the "server side" subprocess, so the raw data coming back still needs to be buffered for handling line buffering. But perhaps there is an optimization somewhere in here. |
It's part of the |
I was just thinking something similar. |
I'm really curious about if things would be a lot better if
it certainly looks like
I looked and It certainly would be part of the problem. Edit: in
Edit2: I'm stupid .... |
I can quickly try this since I have the test environment all set up already. |
Compiling cbuf with |
That causes compilation errors:
But I'll see if I can fix that as well (I actually think we have this fixed in our patches for conda) |
Eh, I just ended up disabling those errors/warnings: diff --git a/src/common/liblsd/Makefile.am b/src/common/liblsd/Makefile.am
index 3670604ae..009bf7aa6 100644
--- a/src/common/liblsd/Makefile.am
+++ b/src/common/liblsd/Makefile.am
@@ -1,13 +1,16 @@
AM_CFLAGS = \
$(WARNING_CFLAGS) \
$(CODE_COVERAGE_CFLAGS) \
- -Wno-parentheses -Wno-error=parentheses
+ -Wno-parentheses -Wno-error=parentheses \
+ -Wno-error=unused-but-set-variable \
+ -Wno-unused-but-set-variable
AM_LDFLAGS = \
$(CODE_COVERAGE_LIBS)
AM_CPPFLAGS = \
- -DWITH_PTHREADS
+ -DWITH_PTHREADS \
+ -DNDEBUG
noinst_LTLIBRARIES = liblsd.la
|
Ok, that removed
|
disabling output prep/check was pretty easy and passed all tests, although I wonder if it works as a "fallout" and not due to correct implementation (more below). the "input" side ("input" being writing to the subprocess) was a different story. it's obviously been a long time since i wrote the flux buffers, but I think the main issue is that the write buffer is filled "outside of the reactor" in I think the output side happens to work b/c the read buffer is filled within a |
Hmmmm. Interesting. Nothing strange in the logs? I'm not sure what the job is that you're running, but it makes me think that the buffers are not emptying, thus spinning. |
The workload is multiple 128N/128p jobs on a size=1536 instance with a flat topology. When getting up to about 64 simultaneous jobs I started seeing this behavior. (That's 64*128 = 8192 active job shells all launched from rank 0) |
per offline chat with @garlick, the general belief is that something not "libsubprocess is slow" is causing the primary performance issue (buffer not emptying, something causing reactor to constantly exit idle phase, etc.). that said, we certainly want to speed up things in general
|
FYI - the solution propose by @garlick (patch in this comment) worked well. |
@chu11, FYI I added this issue to our milestone for next week's release. LMK if that's not feasible. Seems like at least we could get the performance improvements in. |
I think it should be doable, my analysis a few days ago suggested we can remove the additional buffering and call the output callback directly. |
Doh! Once I started looking to making actual changes it became obvious. We buffer b/c it's part of the API. The output callback is not passed the output data & length. The user calls So I don't think this is going to be an optimization option. |
Darn, I didn't notice that either. Well there are still the other optimizations and the mystery of the reactor loop that wouldn't stop. |
Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length? It is kind of ugly but could be used to work around this performance issue in the job-exec module. |
Another way to do it would be to add a flag that could be passed to |
That's a good diea too, though we'd still need a different callback prototype: typedef void (*flux_subprocess_output_f) (flux_subprocess_t *p,
const char *stream); The original subprocess API was modeled after some existing event-driven interfaces that just pass back an event and the callback handles the event. A bit of an unfortunate choice in hindsight. |
That reminds me - I have a subprocess cleanup branch in which i consolidated the stdout, stderr, and channel output callback ops into one (since the stream name is passed into the callback). Maybe that would be a good thing to do before adding an another output type, since otherwise it would require adding three new operations. |
ok, I see three paths (some comments were in #5932)
personally, I think option 3 is the best (devil is in the details until I start), but that's the path I think would be best. |
I don't see an option for enabling an alternate output callback function as suggested above, either with or without a flag to disable creation of the buffers in the first place. This would then allow only updating the job-exec module where we are experiencing the issue. |
yeah, we would either A) add another callback option with the flag or B) change the output function prototype and the flags would indicate if data is buffered or passed via the function callback. |
Another idea on that one that might simplify things? Could we keep all the function prototypes the same but
|
I like that! |
Hmmm, hit a minor non-optimal thing, per
"Buffer is guaranteed to be NUL terminated. " means we can't just send over the data from the response payload. We need to copy it into a buffer and add a NUL at the end. This is what is done internally within an I pondered if we should create an alternate new function to return the data from the message directly without a NUL termination, but given how common it is to simply:
that's probably not a good idea. I hate to carry around the buffer and |
question, should we revert PR #5932 (the short term temp solution) when we add FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF support? There's the part of me that thinks we should revert it b/c FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF was supposed to be the "real solution". On the other hand, what was done in #5932 is fine ... Was just going back and forth on it. |
You mean to reduce complexity? It seems like will still be useful for non LOCAL_UNBUF users. |
I had more or less completed
the data from the remote process is written out with data & length, so the NUL termination byte isn't even needed. So any copying we do to even add the NUL byte is not necessary. Soooo .... alternate new idea, the new LOCAL_UNBUF flag returns the data & length directly without NUL termination. This removes the need to create an internal buffer copy to add NUL termination. We create a new function Edit: Technically, b/c the data coming from the server is passed a string, I believe the data should be guaranteed to be NUL terminated all the time regardless. But we should probably not assume that in general. |
Posting the results of some profiling on corona
w/ flux-core 0.61.2, results seem similar to original
w/ current master that has the
w/ the unbuf code
seems like a decent incremental improvement as amount of time in job-exec seemed to go down and amount of time spent in But optimizing the input prep/check would be the next logical step ... although that one is more tricky and gotta decide if the improvement there is worthwhile given its trickiness. Edit: interesting,
(AFAICT we do not compile with pthreads support ... and So perhaps |
I tried to test my branch that reduces the input prep/check side of things but hit what I presume is #5920.
On the one hand, I'm hopeful that means that the input prep/check side of things would actually make this reproducer "pass" and On the other hand, it's not entirely clear to me why this appears with my change. My only assumption is that |
I was hitting that error in my testing as well, FWIW. |
I temporarily commented out the shell oom watcher to avoid inotify, and the reproducer actually runs all sleep jobs to completion!
So maybe all those gajillions of prep-checks were just too much for |
Great! Can you expand the stack trace for |
Oops, I didn't save that data off. I'll try again another time. From what I remember, there wasn't much under |
Just profiled my "reduce prep/check of input watcher" and expanded out the calls.
No super obvious bottlenecks. We could certainly try to optimize a few of these to try to incrementally get things to perform better. Edit: It's worth noting that because all the jobs completed successfully and the input & output prep/check was not started unless needed, it stands to reason that Edit2: I should try again AFTER the jobs have started, to see if there is anything "flying around" in |
I ran another profiling w/ the reduced input prep/check (#6002). However, unlike above, I ran the profiling AFTER all of the jobs had started. This was an attempt to see if anything was regularly waking up So I tried again, but increased the profiling frequency to the max. Still nothing from Sooo ... it's possible there's something rare that tickles Or it makes me wonder if the input/output prep/check in Or perhaps the sheer quantity of |
Should this issue be closed now? I'm unclear if anything else is required here. |
I was running a test instance of size=1536 across 32 real nodes, each with 96 cores. When launching 128N/128p jobs there seemed to be some kind of performance issue in job-exec and jobs were not making progress. Perf report shows a lot of time spent in the libsubprocess watchers:
A
sleep 1
prolog and epilog were configured, but the job-manager did not appear to have high utilization inperf
, so maybe that is a red herring.The other thing to note is that the nodes are being heavily oversubscribed in this test, so perhaps that's a contributing factor.
The text was updated successfully, but these errors were encountered: