Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Add RCU read bindings (#143) and each_process() iterator #250

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

geofft
Copy link
Collaborator

@geofft geofft commented Aug 17, 2020

Add an RcuReadGuard object for calling functions that expect to be
RCU-protected, and add a Rust equivalent of for_each_process and very
light task_struct bindings sufficient to write a simple test/demo.

@geofft geofft requested a review from alex August 17, 2020 04:51
@geofft
Copy link
Collaborator Author

geofft commented Aug 17, 2020

Obviously the test case needs work but I'd like some feedback on this design. I'm liking it, but at this point I've made myself familiar with how RCU works :)

@geofft geofft mentioned this pull request Aug 17, 2020
) -> linux_kernel_module::KernelResult<()> {
let g = rcu::RcuReadGuard::new();
for mut p in sched::each_process(&g) {
buf.write(&p.comm())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to a user pointer can sleep, right, so this is incorrect, since you may sleep during an RCU read-side critical section?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right. How would you do this in C? Make a local variable, fill it in, and then copy_to_user after rcu_read_unlock?

I can do that, I'm not sure how to get Rust to catch it, and I'm curious if Sparse would have caught it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is tricky, since in the kernel we usually want to avoid dynamically-sized allocation if we can (especially with lengths under user control), so copying the entire output into memory is frowned upon, but also, once we rcu_read_unlock, we can't guarantee our struct task_struct * linked list doesn't change out from under us.

In practice, something like this would probably be built on the seq_file abstraction. seq_file users have a "position" abstraction that they must use to track progress through their underlying data source, and in practice, the seq_file infrastructure implements a single-page bounce buffer that callers write into (one page at a time), and then seq_file copies that back to userspace.

But wait, you might ask, how do you use that position abstraction to keep your position in the task_struct linked list across each page of data? Do you have to get_task_struct the current finger into the list?

Well, no, and in fact that would be illegal, since it would allow userspace to construct a long-lived reference to a task by opening your file, advancing through it with small reads until it got to the point you wanted, and then just sitting on the fd. By sending that fd to another process you could construct a circular list of task references, which is no go.

So instead, you would just keep your cursor as an integer "index into the task list", and, each time you were called upon to return another page of data, you would start with init_task and advance N tasks forward and resume printing.

This is, you may note, both (a) quadratic, and (b) racy.

And yes, this is how almost every file in /proc works; If you read (e.g.) /proc/mounts and it takes more than a single read() call to return the full contents, you can miss mount entries, even ones that existed continually for the entire open/read/read/close cycle, if there are concurrent mounts and unmounts.

This was a fun day at Ksplice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to printk, which one hopes is safe from RCU context. (Our println! is heap-free.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, printk is safe in any context, even really hostile ones like NMIs or stop_machine

Copy link
Member

@alex alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reference on RCUs to get up to speed?

@nelhage's comment leaves me a bit concerned that it's incredibly easy to accidentally violate the rules leading to deadlocks. Is that just life in C as well?

src/sched.rs Outdated
/// typically the base name of the command, and does not have the
/// full path or arguments. It's a fixed-sized set of bytes, but by
/// convention it's interpreted as NUL-terminated.
pub fn comm(&mut self) -> Vec<u8> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No practical way to return an &[u8]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.0.comm is only valid while we've got a lock on self.0, and I don't want to hold a lock for the lifetime of this structure.

We could probably expose the locking to Rust, i.e, something like

impl TaskStruct {
    fn lock(&mut self) -> LockedTaskStruct;
}
impl LockedTaskStruct {
    fn comm(&mut self) -> &[u8];
}

or perhaps even something like

impl TaskStruct {
    fn comm(&mut self, g: &mut LockGuard) -> &[u8];
}

though given the lack of NLL, I'm more hesitant about using an RAII guard to lock a task_struct than I am about making an RCU read-side critical section. An RcuReadGuard only prevents cleanup after making changes, it doesn't prevent a writer from doing the reading, copying, or updating.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to return a [u8; TASK_COMM_LEN]? This is basically what get_task_comm does in C.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that would work. I was actually leaning towards taking an &mut [u8] but I guess yeah you can just return the array and make the compiler deal with it.

(This code was largely just "make anything work at all")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it to returning a [u8; TASK_COMM_LEN].

It seems that many of my task structs have some garbage after the null byte. For instance, if in Python you open("/proc/self/comm").write("foo"), you'll find that your comm is now foo\0on\0\0\0... because it was previously python. I could return this directly, but because we support memory safety here on Team Rust, and because I had to do shenanigans to convert i8 -> u8 anyway, I opted to cut it off after the first NUL byte.

src/sched.rs Outdated
unsafe {
task_lock_helper(self.0);
// if only char were unsigned char
let v = (&*(&(*self.0).comm[..] as *const _ as *const [u8])).to_vec();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waaaaa

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets you an [i8; 16] (i.e. char[16]) and this seems to be the preferred non-transmute way to turn that into a [u8]. We should probably make a wrapper for this, this is sort of the job of OsStr or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less shenanigous now. (It's more verbose but I hope the optimizer knows what it's doing.)

src/sched.rs Show resolved Hide resolved
src/sched.rs Outdated Show resolved Hide resolved
tests/for-each-process/src/lib.rs Outdated Show resolved Hide resolved
@geofft
Copy link
Collaborator Author

geofft commented Aug 17, 2020

Is there a good reference on RCUs to get up to speed?

There's an overview and some links in #143.

I'm about 80% sure that "crossbeam, but you guarantee epochs end because the kernel has precise control over scheduling" is a good tl;dr.

The basic idea is it's an extremely reader-optimized reader-writer lock pattern. A "writer" can't modify an RCU-protected structure directly. Instead, it makes a copy of the thing it wants to modify, updates that, atomically updates a pointer, waits for all readers of the old thing to be done, and then frees the old thing. If you know that readers exit quickly, then you can do this with almost no overhead on the reader side and without having an actual lock in memory to point to.

So it's mostly useful for protecting linked lists and other structures, like the list of processes in this example. It does not protect data that isn't behind a pointer, which means A) it's not a tool for multiple writers to coordinate (you want a real lock for that) and B) as in this example, actually accessing data inside a specific item in the linked list in a race-free way usually requires taking a real lock.

@nelhage
Copy link
Collaborator

nelhage commented Aug 17, 2020

@nelhage's comment leaves me a bit concerned that it's incredibly easy to accidentally violate the rules leading to deadlocks. Is that just life in C as well?

This is correct. The kernel rules about (a) when it's safe to schedule(), and (b) what operations will or won't schedule() are subtle and very easy to violate. And, of course, not uniformly documented, just a matter of oral tradition and lore and smatterings of references in different documents.

@geofft
Copy link
Collaborator Author

geofft commented Aug 17, 2020

I think this is an equivalent problem to only calling async-signal-safe functions from a signal handler, or only calling kmalloc(GFP_NOWAIT) instead of kmalloc(GFP_KERNEL) in an interrupt. It seems like it's theoretically possible for Rust to enforce this by putting annotations on functions, I just don't know how.

@geofft
Copy link
Collaborator Author

geofft commented Aug 17, 2020

https://github.com/japaric/cargo-call-stack might be some inspiration for "you can't call function Y from function X". It seems to work by parsing LLVM IR (it has its own parser) and constructing a call graph.

@geofft geofft changed the title WIP: RCU Add RCU read bindings (#143) and each_process() iterator Aug 19, 2020
@geofft geofft marked this pull request as ready for review August 19, 2020 00:18
@geofft
Copy link
Collaborator Author

geofft commented Aug 21, 2020

I think this is ready to land - any concerns about the current API?

@alex
Copy link
Member

alex commented Aug 21, 2020 via email

src/sched.rs Outdated
Some(lvalue) => {
*lvalue = *byte as _;
}
None => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even reachable? Can you zip .comm and result together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Hopefully not?
  2. Probably? I mean really I want to just call copy_from_slice but there's no way to do that safely while casting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this code gets much simpler if we're using ArrayVec :-)

src/sched.rs Outdated
_g: &'g rcu::RcuReadGuard,
}

pub fn each_process(g: &rcu::RcuReadGuard) -> EachProcess {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return impl Iterator<Item=TaskStruct>>? no reason to have this type in the public API, right?

src/sched.rs Show resolved Hide resolved
let g = rcu::RcuReadGuard::new();
for mut p in sched::each_process(&g) {
let comm = p.comm();
let comm_until_nul = comm.split(|c| *c == 0).next().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a null if the comm name uses up the full buffer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the buffer as returned by the kernel always includes at least one trailing NUL, but more importantly, split() always returns at least one element so we're fine either way.

tests/for-each-process/tests/tests.rs Show resolved Hide resolved
/// typically the base name of the command, and does not have the
/// full path or arguments. It's a fixed-sized set of bytes, but by
/// convention it's interpreted as NUL-terminated.
pub fn comm(&mut self) -> [u8; bindings::TASK_COMM_LEN as usize] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On thinking more... do we need a type that's backed by a fixed size array, but maintains a "how much of this is valid" usize and derefs to slice/mut slice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. smallvec is basically this except that it falls back to Vec; I think we want one that just doesn't fall back.

While we're at it I also want to bury the i8/u8 nonsense under an abstraction too (i.e. impl Deref<Target=[u8]> for something constructed from an &[i8]). Really what I think I want is OsStr from libstd :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an RcuReadGuard object for calling functions that expect to be
RCU-protected, and add a Rust equivalent of for_each_process and a very
light task_struct bindings sufficient to write a simple test/demo.
// - next_task calls rcu_dereference internally, which is safe
// because we hold self._g.
// - The returned reference has lifetime 'g, which is valid
// because self._g lives at least that long.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that holding RCU doesn't guarantee the task list to be unchanged, RCU read-side critical section here only guarantees that the task_struct we reference here won't get freed (in side the RCU read-side critical section). We use another lock "tasklist_lock" to protect the tasklist.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants