Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec: possible scaling issue managing many job shells #5919

Closed
grondo opened this issue Apr 29, 2024 · 48 comments
Closed

job-exec: possible scaling issue managing many job shells #5919

grondo opened this issue Apr 29, 2024 · 48 comments

Comments

@grondo
Copy link
Contributor

grondo commented Apr 29, 2024

I was running a test instance of size=1536 across 32 real nodes, each with 96 cores. When launching 128N/128p jobs there seemed to be some kind of performance issue in job-exec and jobs were not making progress. Perf report shows a lot of time spent in the libsubprocess watchers:

-   96.26%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)
     __GI___clone (inlined)
     start_thread
     module_thread
     mod_main
     flux_reactor_run
     ev_run
   - ev_run
      - 84.93% ev_invoke_pending
         - 21.03% remote_out_check_cb
            - 14.38% fbuf_has_line
               - 13.48% cbuf_peek_line
                  - 12.15% cbuf_is_valid
                     - 1.04% cbuf_mutex_is_locked (inlined)
                          1.02% __GI___pthread_mutex_trylock (inlined)
                    0.76% __GI___pthread_mutex_lock (inlined)
              1.22% cbuf_used
              0.98% flux_watcher_stop
         - 17.73% remote_out_prep_cb
            - 14.68% fbuf_has_line
               - 13.93% cbuf_peek_line
                  + 12.53% cbuf_is_valid
                    0.91% __GI___pthread_mutex_lock (inlined)
         - 15.24% remote_in_check_cb
            + 5.03% cbuf_used
              0.73% flux_watcher_stop
         + 10.36% remote_in_prep_cb
           7.62% prepare_cb
           6.98% check_cb
      - 11.32% queue_events (inlined)
           ev_feed_event

A sleep 1 prolog and epilog were configured, but the job-manager did not appear to have high utilization in perf, so maybe that is a red herring.

The other thing to note is that the nodes are being heavily oversubscribed in this test, so perhaps that's a contributing factor.

@garlick
Copy link
Member

garlick commented Apr 29, 2024

Hmm, it seems like the output prep check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.

We might want to see what the profile looks like with the following change (all subprocess unit tests pass wtih this):

diff --git a/src/common/libsubprocess/remote.c b/src/common/libsubprocess/remote.c
index 3dd9a3ca1..2e009c994 100644
--- a/src/common/libsubprocess/remote.c
+++ b/src/common/libsubprocess/remote.c
@@ -56,7 +56,6 @@ static void start_channel_watchers (flux_subprocess_t *p)
         flux_watcher_start (c->in_prep_w);
         flux_watcher_start (c->in_check_w);
         flux_watcher_start (c->out_prep_w);
-        flux_watcher_start (c->out_check_w);
         c = zhash_next (p->channels);
     }
 }
@@ -303,8 +302,10 @@ static void remote_out_prep_cb (flux_reactor_t *r,
      * reactors are closed */
     if ((c->line_buffered && fbuf_has_line (c->read_buffer))
         || (!c->line_buffered && fbuf_bytes (c->read_buffer) > 0)
-        || (c->read_eof_received && !c->eof_sent_to_caller))
+        || (c->read_eof_received && !c->eof_sent_to_caller)) {
         flux_watcher_start (c->out_idle_w);
+        flux_watcher_start (c->out_check_w);
+    }
 }
 
 static void remote_out_check_cb (flux_reactor_t *r,
@@ -315,6 +316,7 @@ static void remote_out_check_cb (flux_reactor_t *r,
     struct subprocess_channel *c = arg;
 
     flux_watcher_stop (c->out_idle_w);
+    flux_watcher_stop (c->out_check_w);
 
     if ((c->line_buffered
          && (fbuf_has_line (c->read_buffer)
@@ -336,7 +338,6 @@ static void remote_out_check_cb (flux_reactor_t *r,
      * reactors are closed */
     if (c->eof_sent_to_caller) {
         flux_watcher_stop (c->out_prep_w);
-        flux_watcher_stop (c->out_check_w);
 
         /* close input side as well */
         flux_watcher_stop (c->in_prep_w);

@chu11 I think you are the most familiar with this code. Does this analysis sound right?

Edit: I am referring to this code

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

@garlick yeah, that analysis looks right to me. I think we could also do the "in" check watcher as well.

I remember the cbuf_is_valid() from awhile back eating up a nice chunk of CPU

#2527

it's a bug chunk of checks

cbuf_is_valid (cbuf_t cb)
and seems to take up the majority of the "has_line" check.

that's another thing we could tweak if we wanted to.

@garlick
Copy link
Member

garlick commented Apr 29, 2024

remember the cbuf_is_valid() from awhile back eating up a nice chunk of CPU

Looks like cbuf complies that out if NDBEBUG is set, along with other checks. Maybe we should be building it with that set.

Did you want to propose a PR for all this?

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

sure lemme give it a shot

@chu11 chu11 self-assigned this Apr 29, 2024
@garlick
Copy link
Member

garlick commented Apr 29, 2024

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

I retried with @garlick's patch above which didn't seem to improve things. The broker is still spending a majority of its time in these three remote_out and remote_in prep/check watchers:

-   97.01%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone ◆
     __GI___clone (inlined)                                                    ▒
     start_thread                                                              ▒
     module_thread                                                             ▒
     mod_main                                                                  ▒
     flux_reactor_run                                                          ▒
     ev_run                                                                    ▒
   - ev_run                                                                    ▒
      - 87.27% ev_invoke_pending                                               ▒
         + 33.25% remote_out_prep_cb                                           ▒
         + 19.65% remote_in_check_cb                                           ▒
         + 16.81% remote_in_prep_cb                                            ▒
           11.28% prepare_cb                                                   ▒
           1.51% check_cb                                                      ▒
      + 9.72% queue_events (inlined)                                           ▒

And I'm fairly certain I recompiled, but let me check again after a make clean

@garlick
Copy link
Member

garlick commented Apr 29, 2024

Ah this seems to have gotten remote_out_check_cb out of the picture which is progress.

I guess remote_out_prep_cb() is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

Hmmmm. It seems that we don't pass LINE_BUFFERED to the "server side" subprocess, so the raw data coming back still needs to be buffered for handling line buffering. But perhaps there is an optimization somewhere in here.

@garlick
Copy link
Member

garlick commented Apr 29, 2024

It seems that we don't pass LINE_BUFFERED to the "server side" subprocess,

It's part of the flux_cmd_t object (an option) so it should be getting passed through to the server side.

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

I guess remote_out_prep_cb() is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.

I was just thinking something similar. fbuf_has_line() calls cbuf_peek_line(), and that's heavyweight b/c of A) the cbuf_is_valid() which is a bunch of asserts including some memcmp() calls. And B) "is a line available" is a pretty slow iteration looking for a \n.

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

The broker is still spending a majority of its time in these three remote_out and remote_in prep/check watchers:

I'm really curious about if things would be a lot better if cbuf_is_valid() was commented out. Looking at your perf output:

         - 17.73% remote_out_prep_cb
            - 14.68% fbuf_has_line
               - 13.93% cbuf_peek_line
                  + 12.53% cbuf_is_valid
                    0.91% __GI___pthread_mutex_lock (inlined)

it certainly looks like cbuf_is_valid() is taking up the vast majority of the time for remote_out_prep_cb(). Looking internally, cbuf_peek_line() calls cbuf_is_valid() twice.

         - 15.24% remote_in_check_cb
            + 5.03% cbuf_used
              0.73% flux_watcher_stop
         + 10.36% remote_in_prep_cb
           7.62% prepare_cb
           6.98% check_cb

I looked and cbuf_used() also calls cbuf_is_valid(). So I suspect that's a big chunk of the 5.03 percent in remote_in_check_cb() and probably a decent chunk of remote_in_prep_cb() too.

It certainly would be part of the problem.

Edit:

in remote_in_check_cb() I guess I wasn't doing myself any favors calling fbuf_bytes() twice for the most likely common path, thus calling calling cbuf_is_valid() twice alot. Could whittle that done to one call.

    if (fbuf_bytes (c->write_buffer) > 0) {                                                                                                 
        if (remote_write (c) < 0)                                                                                                           
            goto error;                                                                                                                     
    }                                                                                                                                       
                                                                                                                                            
    if (!fbuf_bytes (c->write_buffer) && c->closed && !c->write_eof_sent) {                                                                 
        if (remote_close (c) < 0)                                                                                                           
            goto error;                                                                                                                     
        c->write_eof_sent = true;                                                                                                           
    }                                  

Edit2: I'm stupid .... remote_write() could change bytes in buffer, has to be called twice

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

I can quickly try this since I have the test environment all set up already.

@garlick
Copy link
Member

garlick commented Apr 29, 2024

Compiling cbuf with -DNDEBUG may be the way to go.

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

That causes compilation errors:

$ make V=1
depbase=`echo cbuf.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../config  -DWITH_PTHREADS -DNDEBUG  -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers  -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF $depbase.Tpo -c -o cbuf.lo cbuf.c &&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../config -DWITH_PTHREADS -DNDEBUG -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF .deps/cbuf.Tpo -c cbuf.c  -fPIC -DPIC -o .libs/cbuf.o
cbuf.c: In function ‘cbuf_peek_line’:
cbuf.c:665:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  665 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_read_line’:
cbuf.c:701:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  701 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_replay_line’:
cbuf.c:738:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  738 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_write_line’:
cbuf.c:812:23: error: variable ‘n’ set but not used [-Werror=unused-but-set-variable]
  812 |     int nfree, ncopy, n;
      |                       ^
cc1: all warnings being treated as errors
make: *** [Makefile:537: cbuf.lo] Error 1

But I'll see if I can fix that as well (I actually think we have this fixed in our patches for conda)

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

Eh, I just ended up disabling those errors/warnings:

diff --git a/src/common/liblsd/Makefile.am b/src/common/liblsd/Makefile.am
index 3670604ae..009bf7aa6 100644
--- a/src/common/liblsd/Makefile.am
+++ b/src/common/liblsd/Makefile.am
@@ -1,13 +1,16 @@
 AM_CFLAGS = \
        $(WARNING_CFLAGS) \
        $(CODE_COVERAGE_CFLAGS) \
-       -Wno-parentheses -Wno-error=parentheses
+       -Wno-parentheses -Wno-error=parentheses \
+       -Wno-error=unused-but-set-variable \
+       -Wno-unused-but-set-variable
 
 AM_LDFLAGS = \
        $(CODE_COVERAGE_LIBS)
 
 AM_CPPFLAGS = \
-       -DWITH_PTHREADS
+       -DWITH_PTHREADS \
+       -DNDEBUG
 
 noinst_LTLIBRARIES = liblsd.la

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

Ok, that removed cbuf_is_valid() from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:

-   95.76%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)                                                                                                                                                                                     
     __GI___clone (inlined)                                                                                                                                                                                                                                                  
     start_thread                                                                                                                                                                                                                                                            
     module_thread                                                                                                                                                                                                                                                           
     mod_main                                                                                                                                                                                                                                                                
     flux_reactor_run                                                                                                                                                                                                                                                        
     ev_run                                                                                                                                                                                                                                                                  
   - ev_run                                                                                                                                                                                                                                                                  
      - 79.70% ev_invoke_pending                                                                                                                                                                                                                                             
         - 22.65% remote_in_prep_cb                                                                                                                                                                                                                                          
            + 4.32% cbuf_used                                                                                                                                                                                                                                                
              1.88% fbuf_bytes                                                                                                                                                                                                                                               
         - 22.62% remote_in_check_cb                                                                                                                                                                                                                                         
            + 5.30% cbuf_used                                                                                                                                                                                                                                                
              0.97% fbuf_bytes                                                                                                                                                                                                                                               
              0.78% flux_watcher_stop                                                                                                                                                                                                                                        
         - 15.09% remote_out_prep_cb                                                                                                                                                                                                                                         
            - 6.37% fbuf_has_line                                                                                                                                                                                                                                            
               - 4.89% cbuf_peek_line                                                                                                                                                                                                                                        
                    3.10% __GI___pthread_mutex_lock (inlined)                                                                                                                                                                                                                
                    1.13% __pthread_mutex_unlock_usercnt                                                                                                                                                                                                                     
           10.87% prepare_cb                                                                                                                                                                                                                                                 
           2.60% check_cb                                                                                                                                                                                                                                                    
      - 16.04% queue_events (inlined)                                                                                                                                                                                                                                        
           15.84% ev_feed_event

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

Hmm, it seems like the output prep check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.

disabling output prep/check was pretty easy and passed all tests, although I wonder if it works as a "fallout" and not due to correct implementation (more below).

the "input" side ("input" being writing to the subprocess) was a different story.

it's obviously been a long time since i wrote the flux buffers, but I think the main issue is that the write buffer is filled "outside of the reactor" in flux_subprocess_write(). So when data is in the buffer and ev_io_start() is called underneath the covers, if watchers haven't been started yet on the buffer, I don't know if ev_io_start() does anything. And that can lead to issues.

I think the output side happens to work b/c the read buffer is filled within a check callback, so the reactor is already humming along and an the next "prep" cycle can detect the buffer not being empty.

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

Ok, that removed cbuf_is_valid() from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:

Hmmmm. Interesting. Nothing strange in the logs? I'm not sure what the job is that you're running, but it makes me think that the buffers are not emptying, thus spinning.

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

The workload is multiple 128N/128p jobs on a size=1536 instance with a flat topology. When getting up to about 64 simultaneous jobs I started seeing this behavior. (That's 64*128 = 8192 active job shells all launched from rank 0)

@chu11
Copy link
Member

chu11 commented Apr 29, 2024

per offline chat with @garlick, the general belief is that something not "libsubprocess is slow" is causing the primary performance issue (buffer not emptying, something causing reactor to constantly exit idle phase, etc.).

that said, we certainly want to speed up things in general

  1. deal with cbuf_is_valid() slowness, we'll leave that to issue cbuf: performance improvements #2527
  2. speed up libsubprocess in general
    • general ability to limit the number of calls to all of the busy cbuf functions
    • perhaps no buffering on remote output, call output callback directly?
  3. other ideas
    • I dislike the fact cbuf_peek_line() does a loop and checks for a newline on every line of output, could this be sped up in some way. Hypothetically a "fbuf_write_line()` function and we can just flag that a line is in the buffer?

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2024

deal with cbuf_is_valid() slowness, we'll leave that to issue #2527

FYI - the solution propose by @garlick (patch in this comment) worked well.

@garlick
Copy link
Member

garlick commented May 1, 2024

@chu11, FYI I added this issue to our milestone for next week's release. LMK if that's not feasible. Seems like at least we could get the performance improvements in.

@chu11
Copy link
Member

chu11 commented May 1, 2024

I think it should be doable, my analysis a few days ago suggested we can remove the additional buffering and call the output callback directly.

@chu11
Copy link
Member

chu11 commented May 2, 2024

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

Doh! Once I started looking to making actual changes it became obvious. We buffer b/c it's part of the API. The output callback is not passed the output data & length. The user calls flux_subprocess_read() and its variants in the callback and can read as much or as little as they care for. So it needs to be buffered somewhere.

So I don't think this is going to be an optimization option.

@garlick
Copy link
Member

garlick commented May 2, 2024

I don't think this is going to be an optimization option.

Darn, I didn't notice that either.

Well there are still the other optimizations and the mystery of the reactor loop that wouldn't stop.

@grondo
Copy link
Contributor Author

grondo commented May 3, 2024

Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length? It is kind of ugly but could be used to work around this performance issue in the job-exec module.

@garlick
Copy link
Member

garlick commented May 3, 2024

Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length?

Another way to do it would be to add a flag that could be passed to flux_rexec() like FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF that selects this behavior at the onset so those buffers don't have to be created then destroyed.

@grondo
Copy link
Contributor Author

grondo commented May 3, 2024

That's a good diea too, though we'd still need a different callback prototype:

typedef void (*flux_subprocess_output_f) (flux_subprocess_t *p,
                                          const char *stream);

The original subprocess API was modeled after some existing event-driven interfaces that just pass back an event and the callback handles the event. A bit of an unfortunate choice in hindsight.

@garlick
Copy link
Member

garlick commented May 3, 2024

That reminds me - I have a subprocess cleanup branch in which i consolidated the stdout, stderr, and channel output callback ops into one (since the stream name is passed into the callback). Maybe that would be a good thing to do before adding an another output type, since otherwise it would require adding three new operations.

@chu11
Copy link
Member

chu11 commented May 3, 2024

ok, I see three paths (some comments were in #5932)

  1. simple "don't start prep/check unless necessary"

    • output side - not too bad
    • input side - tricky stuff, will take some refactoring work
    • liklihood meet release May 7th, ok chance
  2. use fbuf watcher instead, which can handle some prep/check start/stop trickiness when it is empty/non-empty

    • output side - assumed this can be done, will probably have to handle some tricky stuff
    • input side - dunno how hard it will be to handle tricky parts, could just skip this half
    • liklihood meet release May 7th, i'm not confident but possible
  3. remote_output - call remote output callback directly instead of using buffers, will require re-work of output callback function and all callers

    • liklihood meet release May 7th: can't imagine this happening

personally, I think option 3 is the best (devil is in the details until I start), but that's the path I think would be best.

@grondo
Copy link
Contributor Author

grondo commented May 3, 2024

I don't see an option for enabling an alternate output callback function as suggested above, either with or without a flag to disable creation of the buffers in the first place. This would then allow only updating the job-exec module where we are experiencing the issue.

@chu11
Copy link
Member

chu11 commented May 3, 2024

I don't see an option for enabling an alternate output callback function as suggested above, either with or without a flag to disable creation of the buffers in the first place

yeah, we would either A) add another callback option with the flag or B) change the output function prototype and the flags would indicate if data is buffered or passed via the function callback.

@garlick
Copy link
Member

garlick commented May 3, 2024

Another idea on that one that might simplify things? Could we keep all the function prototypes the same but

  • add the FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF flag
  • make a rule that if you use that flag, you can only call flux_subprocess_read() from the output callbacks
  • change it so it reads directly from the response payload
  • if you call the other functions you get EINVAL
  • if you don't consume all the data in the callback, it's gone

@grondo
Copy link
Contributor Author

grondo commented May 3, 2024

I like that!

@chu11
Copy link
Member

chu11 commented May 4, 2024

Hmmm, hit a minor non-optimal thing, per flux_subprocess_read() instructions

/*                                                                                                                                          
 *  Read up to `len` bytes of unread data from stream `stream`.  To                                                                         
 *   read all data, specify 'len' of -1.  'stream' can be "stdout",                                                                         
 *   "stderr", or the name of a stream specified with flux_cmd_add_channel().                                                               
 *                                                                                                                                          
 *   Returns pointer to buffer on success and NULL on error with errno                                                                      
 *   set.  Buffer is guaranteed to be NUL terminated.  User shall not                                                                       
 *   free returned pointer.  Length of buffer returned can optionally                                                                       
 *   returned in 'lenp'.  A length of 0 indicates that the subprocess                                                                       
 *   has closed this stream.                                                                                                                
 */ 

"Buffer is guaranteed to be NUL terminated. " means we can't just send over the data from the response payload. We need to copy it into a buffer and add a NUL at the end. This is what is done internally within an fbuf.

I pondered if we should create an alternate new function to return the data from the message directly without a NUL termination, but given how common it is to simply:

ptr = flux_subprocess_read(...);
printf("%s", ptr);

that's probably not a good idea.

I hate to carry around the buffer and memcpy() into it, but seems like the right thing to do.

@chu11
Copy link
Member

chu11 commented May 9, 2024

question, should we revert PR #5932 (the short term temp solution) when we add FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF support?

There's the part of me that thinks we should revert it b/c FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF was supposed to be the "real solution".

On the other hand, what was done in #5932 is fine ...

Was just going back and forth on it.

@garlick
Copy link
Member

garlick commented May 9, 2024

You mean to reduce complexity? It seems like will still be useful for non LOCAL_UNBUF users.

@chu11
Copy link
Member

chu11 commented May 15, 2024

I had more or less completed local unbuf support when I noticed the following in job-exec.

void jobinfo_log_output (struct jobinfo *job,                                                                                                                
                         int rank,                                                                                                                           
                         const char *component,                                                                                                              
                         const char *stream,                                                                                                                 
                         const char *data,                                                                                                                   
                         int len)                                                                                                                            
{                                                                                                                                                            
    char buf[16];                                                                                                                                            
    if (len == 0 || !data || !stream)                                                                                                                        
        return;                                                                                                                                              
    if (snprintf (buf, sizeof (buf), "%d", rank) >= sizeof (buf))                                                                                            
        flux_log_error (job->h, "jobinfo_log_output: snprintf");                                                                                             
    if (eventlogger_append_pack (job->ev, 0,                                                                                                                 
                                 "exec.eventlog",                                                                                                            
                                 "log",                                                                                                                      
                                 "{ s:s, s:s s:s s:s# }",                                                                                                    
                                 "component", component,                                                                                                     
                                 "stream", stream,                                                                                                           
                                 "rank", buf,                                                                                                                
                                 "data", data, len) < 0)                                                                                                     
        flux_log_error (job->h,                                                                                                                              
                        "eventlog_append failed: %s: message=%s",                                                                                            
                        idf58 (job->id),                                                                                                                     
                        data);                                                                                                                               
}                                                                                                                                                            

the data from the remote process is written out with data & length, so the NUL termination byte isn't even needed. So any copying we do to even add the NUL byte is not necessary.

Soooo .... alternate new idea, the new LOCAL_UNBUF flag returns the data & length directly without NUL termination. This removes the need to create an internal buffer copy to add NUL termination. We create a new function flux_subprocess_read_local_unbuf() (or something like that) that will return the buffer and length without NUL termination just for this flag.

Edit: Technically, b/c the data coming from the server is passed a string, I believe the data should be guaranteed to be NUL terminated all the time regardless. But we should probably not assume that in general.

@chu11
Copy link
Member

chu11 commented May 18, 2024

Posting the results of some profiling on corona

flux alloc -N32 --time-limit=4h
# now in allocation
flux submit -N32 -n1536 flux start sleep inf
flux proxy JOBID flux resource list
flux proxy JOBID flux getattr broker.pid
# i did perf record in another window
perf record -g --call-graph=dwarf -o /tmp/achu/perf.data -p PID
flux proxy JOBID flux submit --cc=1-100 -N128 -n128 sleep 60
# i did perf record in another window
perf report --input=/tmp/achu/perf.data --pid=PID

w/ flux-core 0.61.2, results seem similar to original

+   94.60%     0.31%  flux-broker-0  libflux-core.so.2.0.0          [.] ev_run                                                             ◆
+   94.60%     0.00%  flux-broker-0  libflux-core.so.2.0.0          [.] flux_reactor_run                                                   ▒
+   88.76%     0.00%  flux-broker-0  libc-2.28.so                   [.] __GI___clone (inlined)                                             ▒
+   88.76%     0.00%  flux-broker-0  flux-broker                    [.] module_thread                                                      ▒
+   88.74%     0.00%  flux-broker-0  libpthread-2.28.so             [.] start_thread                                                       ▒
+   84.40%     3.33%  flux-broker-0  libflux-core.so.2.0.0          [.] ev_invoke_pending                                                  ▒
+   83.68%     0.00%  flux-broker-0  job-exec.so                    [.] mod_main                                                           ▒
+   24.89%    19.60%  flux-broker-0  job-exec.so                    [.] cbuf_is_valid                                                      ▒
+   24.55%     3.54%  flux-broker-0  job-exec.so                    [.] fbuf_has_line                                                      ▒
+   21.41%     4.78%  flux-broker-0  job-exec.so                    [.] remote_out_check_cb                                                ▒
+   20.98%     0.55%  flux-broker-0  job-exec.so                    [.] cbuf_peek_line                                                     ▒
+   14.76%     3.03%  flux-broker-0  job-exec.so                    [.] remote_out_prep_cb                                                 ▒
+   12.56%     5.91%  flux-broker-0  job-exec.so                    [.] remote_in_check_cb                                                 ▒
+   11.72%     0.45%  flux-broker-0  job-exec.so                    [.] cbuf_used                                                          ▒
+    9.30%     3.92%  flux-broker-0  job-exec.so                    [.] remote_in_prep_cb                                                  ▒
+    8.67%     0.00%  flux-broker-0  libflux-core.so.2.0.0          [.] queue_events (inlined)                                             ▒
+    8.48%     8.41%  flux-broker-0  libflux-core.so.2.0.0          [.] ev_feed_event                                                      ▒
+    6.39%     6.32%  flux-broker-0  libflux-core.so.2.0.0          [.] check_cb                                                           ▒
+    5.81%     0.00%  flux-broker-0  job-exec.so                    [.] cbuf_mutex_is_locked (inlined)

w/ current master that has the cbuf_is_valid() no longer compiled and the prep/check optimization. seems to be expected, the output prep/check is no longer there and cbuf_is_valid() is no longer there., good chunk less time in job-exec.

+   68.19%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)
+   68.19%     0.00%  job-exec         libpthread-2.28.so     [.] start_thread
+   68.19%     0.00%  job-exec         lt-flux-broker         [.] module_thread
+   68.19%     0.00%  job-exec         job-exec.so            [.] mod_main
+   68.19%     0.00%  job-exec         libflux-core.so.2.0.0  [.] flux_reactor_run
+   68.19%     0.69%  job-exec         libflux-core.so.2.0.0  [.] ev_run
+   51.73%     1.25%  job-exec         libflux-core.so.2.0.0  [.] ev_invoke_pending
+   24.50%     9.50%  job-exec         job-exec.so            [.] remote_in_check_cb
+   19.30%    19.19%  job-exec         job-exec.so            [.] cbuf_used
+   16.62%     5.43%  job-exec         job-exec.so            [.] remote_in_prep_cb
+   16.32%     0.00%  job-exec         libflux-core.so.2.0.0  [.] queue_events (inlined)
+   15.71%    15.61%  job-exec         libflux-core.so.2.0.0  [.] ev_feed_event

w/ the unbuf code

+   59.20%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)
+   59.20%     0.00%  job-exec         libpthread-2.28.so     [.] start_thread
+   59.20%     0.00%  job-exec         lt-flux-broker         [.] module_thread
+   59.20%     0.00%  job-exec         job-exec.so            [.] mod_main
+   59.20%     0.00%  job-exec         libflux-core.so.2.0.0  [.] flux_reactor_run
+   59.20%     0.55%  job-exec         libflux-core.so.2.0.0  [.] ev_run
+   45.40%     1.04%  job-exec         libflux-core.so.2.0.0  [.] ev_invoke_pending
+   20.58%     7.64%  job-exec         job-exec.so            [.] remote_in_check_cb
+   17.08%     0.00%  flux-broker-0    lt-flux-broker         [.] main
+   17.08%     0.00%  flux-broker-0    libflux-core.so.2.0.0  [.] flux_reactor_run
+   17.08%     0.21%  flux-broker-0    libflux-core.so.2.0.0  [.] ev_run
+   16.94%     0.00%  flux-broker-0    lt-flux-broker         [.] _start
+   16.94%     0.00%  flux-broker-0    libc-2.28.so           [.] __libc_start_main
+   16.21%     0.00%  ZMQbg/IO/0       libc-2.28.so           [.] __GI___clone (inlined)
+   16.21%     0.00%  ZMQbg/IO/0       libpthread-2.28.so     [.] start_thread
+   16.21%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4        [.] 0x0000155554a32bee
+   15.96%     0.35%  flux-broker-0    libflux-core.so.2.0.0  [.] ev_invoke_pending
+   15.82%    15.61%  job-exec         job-exec.so            [.] cbuf_used
+   14.94%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4        [.] 0x00001555549f1b8b
+   13.72%     4.42%  job-exec         job-exec.so            [.] remote_in_prep_cb

seems like a decent incremental improvement as amount of time in job-exec seemed to go down and amount of time spent in cbuf_used went down too. Not as much gain from prep/check, but that is probably not too surprising, as the prep/check and cbuf_is_valid optimization was much more impactful from the original.

But optimizing the input prep/check would be the next logical step ... although that one is more tricky and gotta decide if the improvement there is worthwhile given its trickiness.

Edit: interesting, cbuf_used() is almost nothing ..

int                                                                                                                                         
cbuf_used (cbuf_t cb)                                                                                                                       
{                                                                                                                                           
    int used;                                                                                                                               
                                                                                                                                            
    assert (cb != NULL);                                                                                                                    
    cbuf_mutex_lock (cb);                                                                                                                   
    assert (cbuf_is_valid (cb));                                                                                                            
    used = cb->used;                                                                                                                        
    cbuf_mutex_unlock (cb);                                                                                                                 
    return (used);                                                                                                                          
}  

(AFAICT we do not compile with pthreads support ... and cbuf_is_valid() is compiled out of this version)

So perhaps cbuf_used() is just b/c of remote_in_prep/check b/c that calls fbuf_bytes() a ton (which calls cbuf_used())

@chu11
Copy link
Member

chu11 commented May 20, 2024

I tried to test my branch that reduces the input prep/check side of things but hit what I presume is #5920.

2.649s: flux-shell[30]: ERROR: plugin 'oom': shell.init failed
2.649s: flux-shell[30]: FATAL: shell_init
2.648s: flux-shell[31]: ERROR: plugin 'oom': shell.init failed
2.648s: flux-shell[31]: FATAL: shell_init

On the one hand, I'm hopeful that means that the input prep/check side of things would actually make this reproducer "pass" and job-exec can make forward progress on jobs.

On the other hand, it's not entirely clear to me why this appears with my change. My only assumption is that job-exec is actually making forward progress and job shells are being launched for real? Where as before job-exec was stuck and things weren't actually getting launched?? Not sure.

@grondo
Copy link
Contributor Author

grondo commented May 20, 2024

I was hitting that error in my testing as well, FWIW.

@chu11
Copy link
Member

chu11 commented May 21, 2024

I temporarily commented out the shell oom watcher to avoid inotify, and the reproducer actually runs all sleep jobs to completion!

+   27.70%     0.00%  flux-broker-0    lt-flux-broker                                 [.] main
+   27.70%     0.00%  flux-broker-0    libflux-core.so.2.0.0                          [.] flux_reactor_run
+   27.70%     0.24%  flux-broker-0    libflux-core.so.2.0.0                          [.] ev_run
+   26.65%     0.00%  ZMQbg/IO/0       libc-2.28.so                                   [.] __GI___clone (inlined)
+   26.65%     0.00%  ZMQbg/IO/0       libpthread-2.28.so                             [.] start_thread
+   26.65%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4                                [.] 0x0000155554a32bee
+   26.60%     0.00%  flux-broker-0    lt-flux-broker                                 [.] _start
+   26.60%     0.00%  flux-broker-0    libc-2.28.so                                   [.] __libc_start_main
+   25.55%     0.46%  flux-broker-0    libflux-core.so.2.0.0                          [.] ev_invoke_pending
+   23.75%     0.00%  job-exec         libc-2.28.so                                   [.] __GI___clone (inlined)
+   23.75%     0.00%  job-exec         libpthread-2.28.so                             [.] start_thread
+   23.75%     0.00%  job-exec         lt-flux-broker                                 [.] module_thread
+   23.75%     0.00%  job-exec         job-exec.so                                    [.] mod_main
+   23.75%     0.00%  job-exec         libflux-core.so.2.0.0                          [.] flux_reactor_run
+   23.75%     0.16%  job-exec         libflux-core.so.2.0.0                          [.] ev_run
+   22.76%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4                                [.] 0x00001555549f1b8b
+   22.49%     0.07%  job-exec         libflux-core.so.2.0.0                          [.] ev_invoke_pending
+   13.56%     0.05%  flux-broker-0    lt-flux-broker                                 [.] check_cb
+   13.11%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4                                [.] 0x00001555549f601b
+   13.06%     0.00%  ZMQbg/IO/0       libzmq.so.5.2.4                                [.] 0x0000155554a1c84e
+   13.01%     0.00%  ZMQbg/IO/0       [unknown]                                      [.] 0xffffffffb74000e9
+   12.46%     0.01%  ZMQbg/IO/0       [unknown]                                      [.] 0xffffffffb68053eb
+   11.89%     0.01%  job-exec         job-exec.so                                    [.] check_cb
+   11.88%     0.00%  job-exec         job-exec.so                                    [.] exec_start_cmds (inlined)
+   11.88%     0.00%  job-exec         job-exec.so                                    [.] exec_start_cmd (inlined)
+   11.83%     0.00%  job-exec         job-exec.so                                    [.] flux_rexec_ex
+   10.84%     0.04%  flux-broker-0    lt-flux-broker                                 [.] child_cb

So maybe all those gajillions of prep-checks were just too much for job-exec to handle? Well put up PR in near future.

@grondo
Copy link
Contributor Author

grondo commented May 21, 2024

Great! Can you expand the stack trace for job-exec above just so we see what's now taking the most time? Thanks!

@chu11
Copy link
Member

chu11 commented May 21, 2024

Great! Can you expand the stack trace for job-exec above just so we see what's now taking the most time? Thanks!

Oops, I didn't save that data off. I'll try again another time. From what I remember, there wasn't much under mod_main(), maybe just ev_run() or something.

@chu11
Copy link
Member

chu11 commented May 22, 2024

Just profiled my "reduce prep/check of input watcher" and expanded out the calls.

-   23.94%     0.00%  job-exec         job-exec.so                                       [.] mod_main                                                       \
     mod_main                                                                                                                                               \
     flux_reactor_run                                                                                                                                       \
     ev_run                                                                                                                                                 \
   - ev_run                                                                                                                                                 \
      - 22.85% ev_invoke_pending                                                                                                                            \
         - 11.84% check_cb                                                                                                                                  \
            - 11.84% exec_start_cmds (inlined)                                                                                                              \
               - 11.83% exec_start_cmd (inlined)                                                                                                            \
                  - 11.79% flux_rexec_ex                                                                                                                    \
                     - flux_rexec_ex                                                                                                                        \
                        - 8.33% remote_exec                                                                                                                 \
                           - 8.27% subprocess_rexec                                                                                                         \
                              - 4.47% flux_rpc_pack                                                                                                         \
                                 - flux_rpc_vpack                                                                                                           \
                                    - 4.17% flux_msg_vpack                                                                                                  \
                                       - 4.06% json_dumps                                                                                                   \
                                          + 3.66% json_dump_callback                                                                                        \
                              - 3.78% rexec_ctx_create (inlined)                                                                                            \
                                 - 3.77% cmd_tojson                                                                                                         \
                                    - 3.59% envz_tojson                                                                                                     \
                                       + 1.31% json_object_setn_new_nocheck                                                                                 \
                                       + 0.65% 0x1555545b44bc                                                                                               \
                                       + 0.56% 0x1555545b448e                                                                                               \
                        - 2.73% subprocess_remote_setup                                                                                                     \
                           - 2.72% remote_setup_stdio (inlined)                                                                                             \
                              - remote_channel_setup                                                                                                        \
                                 - 2.40% fbuf_create                                                                                                        \
                                    + 1.21% __GI___libc_malloc (inlined)                                                                                    \
                                    + 1.16% cbuf_create                                                                                                     \
                        - 0.66% subprocess_create                                                                                                           \
                           - 0.57% __socketpair (inlined)                                                                                                   \
                              + 0.53% 0xffffffff8aa000e9                                                                                                    \
         - 2.47% rexec_continuation                                                                                                                         \
            - 1.80% subprocess_rexec_get                                                                                                                    \
               + 1.20% flux_rpc_get_unpack                                                                                                                  \
         - 2.21% namespace_move_cb                                                                                                                          \
            - jobinfo_release                                                                                                                               \
               - 2.20% jobinfo_decref                                                                                                                       \
                  - jobinfo_decref                                                                                                                          \
                     - 2.17% exec_exit                                                                                                                      \
                          bulk_exec_destroy                                                                                                                 \
                        - bulk_exec_destroy                                                                                                                 \
                           - 2.04% fzlist_destroy                                                                                                           \
                              - fzlist_purge                                                                                                                \
                                 + 2.01% subprocess_free (inlined)                                                                                          \
         - 2.07% handle_cb                                                                                                                                  \
            + 1.75% dispatch_message (inlined)                                                                                                              \
         - 1.24% exec_exit_notify                                                                                                                           \
            + 0.79% idset_range_clear                                                                                                                       \
         - 0.73% remote_in_check_cb                                                                                                                         \
            + 0.71% remote_write (inlined)                                                                                                                  \
         + 0.67% check_cb                                                                                                                                   \
           0.52% completed_check_cb                                                                                                                         \
      + 0.85% epoll_poll                                                                                                                                     

No super obvious bottlenecks. We could certainly try to optimize a few of these to try to incrementally get things to perform better.

Edit: It's worth noting that because all the jobs completed successfully and the input & output prep/check was not started unless needed, it stands to reason that flux-broker didn't actually do much. So some of these medium-high percentages may not mean much.

Edit2: I should try again AFTER the jobs have started, to see if there is anything "flying around" in job-exec, because with a long sleep job, job-exec should have pretty much no activity.

@chu11
Copy link
Member

chu11 commented May 24, 2024

I ran another profiling w/ the reduced input prep/check (#6002). However, unlike above, I ran the profiling AFTER all of the jobs had started. This was an attempt to see if anything was regularly waking up job-exec and leading to all of those unnecessary prep/check calls. In the first profiling, job-exec literally doesn't show up in the profiling at all.

So I tried again, but increased the profiling frequency to the max. Still nothing from job-exec.

Sooo ... it's possible there's something rare that tickles job-exec that I am just missing via perf-record. We'd have to go through some alternate profiling mechanism to find it I guess.

Or it makes me wonder if the input/output prep/check in libsubprocess was simply the problem, although multiple code reviews from everyone makes me believe the code is fine. And we know that at lower job-shell counts, things appear to be ok.

Or perhaps the sheer quantity of job-shells and prep/check going on has some side effect that we're not really understanding.

@grondo
Copy link
Contributor Author

grondo commented Jun 3, 2024

Should this issue be closed now? I'm unclear if anything else is required here.

@chu11
Copy link
Member

chu11 commented Jun 3, 2024

yeah, i think we can close this. The collection of fixes (mostly #5932 #5975 #6002 , although there were many ancillary ones around those) I believe solves this.

@chu11 chu11 closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants