Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure sidecar response is null terminated #1569

Open
ngwese opened this issue Apr 25, 2022 · 9 comments
Open

ensure sidecar response is null terminated #1569

ngwese opened this issue Apr 25, 2022 · 9 comments
Assignees
Labels

Comments

@ngwese
Copy link
Member

ngwese commented Apr 25, 2022

the norns-converged branch exhibits stability issues around script selection. sometimes the menu hangs, other times it shows garbage at the end. looking at the code my current guess is the buffers returned from the sidecar are not null terminated

memcpy(*result, recv_buf, recv_sz);

@ngwese ngwese added the bug label Apr 25, 2022
@ngwese ngwese self-assigned this Apr 25, 2022
@ngwese
Copy link
Member Author

ngwese commented Apr 25, 2022

will consider addressing this at the server end since the reallocate shrinks the effective buffer without adjusting the termination

buf = (char *)realloc(buf, nb);

@catfact
Copy link
Collaborator

catfact commented Apr 30, 2022

perhaps enough to make it buf = (char *)realloc(buf, nb+1); (adding extra byte for null)

@ngwese
Copy link
Member Author

ngwese commented Apr 30, 2022

i've been poking at this and i believe that there is more than one thing going on. in the case that we read less than CMD_CAPTURE_BYTES - 1 (which seems like the most common case) then the last byte of the the command output is replaced with \0 which stomps on that byte.

norns/norns/sidecar.cpp

Lines 38 to 39 in 421ca6e

size_t nb = fread(buf, 1, CMD_CAPTURE_BYTES - 1, f);
buf[nb] = '\0';

since we never allow the full size of CMD_CAPTURE_BYTES to be read it is safe to nb++ and then terminate. unfortunately that didn't fix the problem.

digging deeper stability improved when i disabled the background stat collection. looking at the code that makes sense. the stat.cc code is creates its own thread and then calls sidecar_client_cmd in a loop. invocation of norns.system_cmd in turn is also calling sidecar_client_cmd but from the lua thread so we have potentially overlapping requests to run commands. since every call to sidecar_client_cmd is doing a synchronous send/recv using the same nanomsg/nng socket i'm not sure that is anything which guarantees that the response coming back from the sidecar is getting matched with the correct requester thread.

looking at the nng docs on req sockets: https://nng.nanomsg.org/man/v1.3.2/nng_req.7.html

Furthermore, only a single receive operation may be pending at a time. Attempts to post more receive operations concurrently will result in NNG_ESTATE.

Requests may be canceled by sending a different request. This will cause the requester to discard any reply from the earlier request, but it will not stop a replier from processing a request it has already received or terminate a request that has already been placed on the wire.

this all leads me to believe the some of all of the following would be worth considering:

  • replace the lua level async command/capture queue and one shot thread created in system_cmd.cc with a dedicated thread consuming consuming from another nng inproc socket and servicing requests to the sidecar serially. this would allow any thread to run system_cmd safely.
  • or change the sidecar client/server such that each thread had its own client connection to the sidecar and change the server to send responses back to the socket where the request originated.
  • maintain a simple client as it is now and move the stat.cc stuff to the sidecar and use an entirely separate socket pair to relay stat information back.

@ngwese
Copy link
Member Author

ngwese commented Apr 30, 2022

actually, it looks like creating different nng "contexts" will allow concurrent use of the socket w/o getting things mixed up.
https://nng.nanomsg.org/man/v1.3.2/nng_ctx_open.3.html

@catfact
Copy link
Collaborator

catfact commented May 1, 2022

awesome, makes sense and thanks for digging into it

replace the lua level async command/capture queue and one shot thread created in system_cmd.cc with a dedicated thread consuming consuming from another nng inproc socket and servicing requests to the sidecar serially. this would allow any thread to run system_cmd safely.

this "feels" like it would be a good thing to do anyway, we don't really need that thread now that there is a separate async execution context available

the last byte of the the command output is replaced with \0 which stomps on that byte

i haven't eaten dinner and might be dense, but wouldn't buf[nb] actually be the element after the last received byte, since we're not writing lua at the moment?

@ngwese
Copy link
Member Author

ngwese commented May 1, 2022

not dense. you are correct - i think my brain was wanting it to be a simple off by one error.

@ngwese
Copy link
Member Author

ngwese commented Jun 11, 2022

sigh. after much reworking of code, moving all the sync and async command invocation into C, and ensuring the interaction with the sidecar process is serialized - the stability problem in the selection menu persists.

@ngwese
Copy link
Member Author

ngwese commented Jun 11, 2022

argh. it was an off by one error which i failed to spot:

norns/norns/sidecar.cpp

Lines 38 to 42 in 421ca6e

size_t nb = fread(buf, 1, CMD_CAPTURE_BYTES - 1, f);
buf[nb] = '\0';
buf = (char *)realloc(buf, nb);
*result = buf;
*sz = nb;

we read nb bytes from the pipe and get back 40 (for example) that means bytes 0-39 are valid output. we then null terminate the string with buf[nb] = '\0'; so byte 40 now is part of the valid output as well.... but then we return nb as the size of the captured output when it should be nb + 1 to account for the null character just added.

on the receiving side (back in matron) the nng message was ultimately loosing the null termination. code which at some point called strlen on the received capture; such as code which pushed the output into lua via lua_pushstring(...), would occasionally pick up garbage.

@catfact
Copy link
Collaborator

catfact commented Jun 11, 2022

derp sorry

ngwese added a commit to ngwese/norns that referenced this issue Jun 11, 2022
- capture buffer termination (issue monome#1569)
- moves async command capture logic from lua to c
- ensures sidecar requests from _norns.execute and _norns.system_cmd
  are not interleaved when called from different threads
- adds progressive capture buffer allocation in server up to
  maximum of 10MB
ngwese added a commit to ngwese/norns that referenced this issue Jun 12, 2022
- capture buffer termination (issue monome#1569)
- moves async command capture logic from lua to c
- ensures sidecar requests from _norns.execute and _norns.system_cmd
  are not interleaved when called from different threads
- adds progressive capture buffer allocation in server up to
  maximum of 10MB
ngwese added a commit that referenced this issue Jun 14, 2022
- capture buffer termination (issue #1569)
- moves async command capture logic from lua to c
- ensures sidecar requests from _norns.execute and _norns.system_cmd
  are not interleaved when called from different threads
- adds progressive capture buffer allocation in server up to
  maximum of 10MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants