Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC issues #1993

Open
bradjc opened this issue Jul 1, 2020 · 4 comments
Open

IPC issues #1993

bradjc opened this issue Jul 1, 2020 · 4 comments
Labels

Comments

@bradjc
Copy link
Contributor

bradjc commented Jul 1, 2020

  1. Right now the IPC allow() call can succeed, and a notify callback can be triggered, even if the expose_to() call fails. This may be fixed in 2.0 upgrades.
  2. client_notify() can be used to determine how many other processes are on the board. By iterating through service IDs the error return values leak which apps are present, even if the apps do not setup any IPC services.
  3. Calling client_notify() on a different process causes the callee's process to allocate the grant region, consuming application memory the original app didn't intend to.
  4. Currently, scheduling a notify callback may fail for a variety of reasons (callback queue is full, failure to allocate, calling process has died, etc and neither the caller nor the callee are made aware of the failure (see TODO in kernel/src/sched.rs)
  5. Buffers not power of two in length will cause silent failures on ARM.
  6. It seems to be the case that App A faulting does not reset the MPU configuration of app B, even if app A shared a buffer with app B. (I haven't verified this experimentally, just tried to read through the fault code -- Hudson)
@bradjc bradjc mentioned this issue Feb 16, 2021
2 tasks
@bradjc bradjc changed the title IPC: Return error if shared memory region cannot be protected due to MPU alignment issues IPC issues Feb 16, 2021
@alevy alevy pinned this issue Feb 17, 2021
@bradjc
Copy link
Contributor Author

bradjc commented Apr 26, 2021

If we re-implement IPC to use message passing rather than shared memory, we can revisit the MPU implementation (re: #1532).

@alexandruradovici
Copy link
Contributor

I was playing with the IPC API and noticed that applications fault due to memory protection if the shared buffers are not multiple of 64. The buffer's memory region does not seem to be mapped.

I cannot figure out the reason for this, all the API calls seem to return success.

If this is a requirement, maybe we should add this to the kernel and refuse the buffer?

@lschuermann
Copy link
Member

lschuermann commented Dec 13, 2021

While debugging #2906 I've noticed a more intricate issue of app load order and, what appears to me as a racy relationship of the IPC service registering its callback and the client notifying the service. This is in the context of a failing rot13_{service,client} CI test. When debugging these issues I used to change the app's order in the TBF list and inserted printf statements. With the rot13_service and rot13_client installed in this order, what seems to be happening in the normal (good) case is:

  1. rot13_client performs initialization and all of its IPC calls including ipc_notify_service in one go, which will enqueue a task in the target process' queue
  2. rot13_service performs initialization and the call to ipc_register_service_callback, which will register its callback
  3. the kernel processes the task enqueued in (1) which calls back to IPC::schedule_upcall, which will ultimately schedule the proper upcall on the rot13_service process
  4. the IPC service callback scheduled in (3) is called by the kernel scheduler

Now, if the kernel were to execute (3) before (2), the IPC task is converted into a callback (FunctionCall) before the service had a chance to subscribe. Then, this function call is dropped because it is a null-callback. Even if it weren't scheduled, as soon as the service subscribes, this pending upcall would be removed from its queue. It appears one can trigger this behavior through a combination of load-order, app and scheduler behavior, but it is most obvious when inserting a simple printf statement at the beginning of ipc_service's main. With the printf statement, if the client is started first, notifying the service will silently fail, whereas if the service is started first, depending on the scheduler behavior, it will likely succeed. Here is an example system call trace of this happening (service prints "Rot13 service!"):

System call trace:

  • Process 0: rot13_service
  • Process 1: rot13_client
Verilated LiteX+VexRiscv: initialization complete, entering main loop.
[1] function_call @0x88088(0x88060, 0x40010000, 0x173c, 0x40010000)
[1] memop(0, 0x40010a40) = Success
[1] memop(10, 0x40010800) = Success
[1] memop(11, 0x40010a40) = Success
[1] read-only allow(0x10000, 0, @0x8ae5c, 0x19) = AllowReadOnlySuccess(0x0, 0)
[1] cmd(0x10000, 1, 0x0, 0x0) = SuccessU32(0)
[1] read-only allow(0x10000, 0, @0x0, 0x0) = AllowReadOnlySuccess(0x8ae5c, 25)
[1] remove_pending_upcalls[0x10000:0] = 0 upcall(s) removed
[1] subscribe(0x10000, 0, @0x88108, 0x40010a00) = SubscribeSuccess(0x0, 0)
[1] read-write allow(0x10000, 0, @0x40010a00, 0x40) = AllowReadWriteSuccess(0x0, 0)
[1] cmd(0x10000, 2, 0x0, 0x0) = Success
[1] yield. which: 1
[0] function_call @0x80088(0x80060, 0x40008000, 0x16dc, 0x40008000)
[0] memop(0, 0x400089cc) = Success
[0] memop(10, 0x40008800) = Success
[0] memop(11, 0x400089cc) = Success
[0] memop(1, 0x0) = SuccessU32(1073777100)
[0] memop(1, 0x1b4) = SuccessU32(1073777100)
[0] memop(1, 0x408) = SuccessU32(1073777536)
[0] memop(1, 0x18) = SuccessU32(1073778568)
[0] memop(1, 0x18) = SuccessU32(1073778592)
[0] read-only allow(0x1, 1, @0x40008fa8, 0xf) = AllowReadOnlySuccess(0x0, 0)
[0] remove_pending_upcalls[0x1:1] = 0 upcall(s) removed
[0] subscribe(0x1, 1, @0x81d14, 0x0) = SubscribeSuccess(0x0, 0)
[0] cmd(0x1, 1, 0xf, 0x0) = Success
Rot13 service!
[0] schedule[0x1:1] @0x81d14(0xf, 0x0, 0x0, 0x0) = Ok(())
[0] yield. which: 1
[0] schedule[0x10000:0] @0x0(0x1, 0x40, 0x40010a00, 0x0) = Ok(())
[0] function_call @0x81d14(0xf, 0x0, 0x0, 0x0)
[0] read-only allow(0x10000, 0, @0x81e08, 0x19) = AllowReadOnlySuccess(0x0, 0)
[0] cmd(0x10000, 1, 0x0, 0x0) = SuccessU32(0)
[0] read-only allow(0x10000, 0, @0x0, 0x0) = AllowReadOnlySuccess(0x81e08, 25)
[0] remove_pending_upcalls[0x10000:0] = 0 upcall(s) removed
[0] subscribe(0x10000, 0, @0x80108, 0x0) = SubscribeSuccess(0x0, 0)
[0] yield. which: 1

We currently have no way to tell whether notifying an IPC service (i.e. scheduling a callback in the service process) actually worked. Whereas the source code used to indicate that we might have a method of identifying a null-callback (EOFF), Process::enqueue_task does not actually provide us any information about whether the enqueued IPC task will result in a callback being scheduled or not. This usually makes sense and is a logical consequence of our architecture based around the fact that a null-callback should be treated exactly the same as a proper "do nothing" callback, but here it's not good.

@LawrenceEsswood
Copy link

To add something to point (6) of the current IPC issues: restarting an application resets app_break/kernel_memory_break//allow_high_water_mark . The result is that an application (A) can:

  1. Make a very large malloc(), causing a brk system call to claim all remaining grant/heap memory for itself.
  2. Allow this to another process (call it B)
  3. Restart itself in place (i.e., by faulting) (and so reset the watermarks)
  4. Make syscalls to capsules causing grant allocations that exceed the size of allocation its previous life

This will result in application B having direct access to kernel data structures. Spicy. This can be leveraged even from safe userspace Rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants