Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: FS/GS-based TLS access abstraction #257

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

WIP: FS/GS-based TLS access abstraction #257

wants to merge 1 commit into from

Conversation

josephlr
Copy link
Contributor

I was trying to use FS/GS based Thread Local Storage (TLS) and I noticed we don't have an easy way to view the FS or GS register as a structure (a common patter in OS-dev).

This patch would allow for the following code:

/// Kernel per-CPU layout
struct PerCpu {
    stack: VirtAddr,
    id: u32, 
}

// Declare "gs looks like PerCpu"
static PER_CPU: Wrapper<GS, PerCpu> = Wrapper::new();

// Kernel needs to call this for each thread
PER_CPU.init(Some(alloc_tls()));

// Fields can now be read/written
unsafe fn set_stack(rsp: VirtAddr) {
    tls_write!(PER_CPU, stack, rsp);
}
fn get_id() -> u32 {
    unsafe {
        tls_read!(PER_CPU, id)
    }
}

I'm not sure exactly what we want from the API, but I wanted to get feedback on the design sooner rather than later. Issues to discuss:

  • Which ops should be safe/unsafe?
  • Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?
  • Should the PerCpu struct have to implement some sort of trait?
  • Should we have a way to just read/write the entire structure (say tls_read!(PER_CPU))?
  • Naming (Tls vs Wrapper, tls_read! vs per_cpu_read!)
  • Which submodule should this live in? (instructions::tls? instructions::segmentation?)

This implementation seems to compile to fairly efficient code. See this Rust implementation vs a C implementation (using __seg_gs). The C implementation is still slightly better, which isn't surprising given that __seg_gs has LLVM support (on the other hand, I was able to get __seg_gs to crash clang, so it's quite unstable).

TODO: Docs, tests, finish design

Signed-off-by: Joe Richey joerichey@google.com

@josephlr josephlr requested a review from phil-opp May 26, 2021 10:27
Signed-off-by: Joe Richey <joerichey@google.com>
@phil-opp
Copy link
Member

Thanks a lot for proposing this! The design proposed here looks good to me overall. The assembly comparison is really cool, it's great to see that the use of inline assembly still leads to optimized code.

As I understand it, there is also native LLVM support for thread local data, which Rust exposes through the unstable #[thread_local] attribute. The problem with this seems to be that only fs is supported, which is typically used for user-space TLS (kernel-level TLS normally uses gs). Is this correct?

One thing that I'm curious about is how the performance of your proposal compares to storing just a pointer gs offset 0, which then points to the actual TLS struct. This would require a second memory lookup, but might allow the compile to optimize it better since the second access is just a normal memory access. I tried this approach here: https://godbolt.org/z/W4a4YxMPK . As expected, there are now two memory accesses every time, which is probably worse for smaller types (it might be worth it to get some actual performance data on this). However, for larger types, it leads to much better code. For example, consider the one_page function, which accesses a single u64 of the page field:

example::one_page:
        xor     eax, eax
        mov     rax, qword ptr gs:[rax]
        mov     rax, qword ptr [rax + 2784]
        ret

Without a pointer, we have to load the full page array first before accessing the field, which the compiler cannot optimize away: https://godbolt.org/z/hMrWdsE1Y

  • Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?

I think for values > 8 byte it makes more sense to go through a pointer for the mentioned performance reasons. Or maybe it is possible to adjust the macro to also support loading a specific array element or nested struct field directly?

We should also think about values that contain padding bytes. For example, does the compiler require them to have a specific value (e.g. 0)?

  • Should we have a way to just read/write the entire structure (say tls_read!(PER_CPU))?

Given that this is probably a very costly operation (the optimizer won't remove any of the inline_asm), I don't think that it's a good idea to support this through the same macro. If needed, we could add a separate macro for this that documents the potential performance problems (and also suggests storing a pointer as an alternative).

  • Which ops should be safe/unsafe?

The Wrapper::init method definitely needs to be unsafe because the given address must be valid (mapped, not aliased, etc.). Also, the TLS structure type must be unique within a thread (mixing wrapper types can lead to silent transmutes). The general read/write methods need to be unsafe too because they allow arbitrary pointers/offsets as argument.

We might be able to make the tls_read!/tls_write! macros safe, but this would some additional checks. For one, we need to assert that the GS/SS registers are initialized before access. A simple is_null check would only work if Wrapper::new initializes GS/SS with 0. Alternatively, we could add an AtomicBool flag to the Wrapper type itself.

  • Naming (Tls vs Wrapper, tls_read! vs per_cpu_read!)

I think I prefer the term TLS over "per cpu" since the fs/gs value is typically set per thread (and updated accordingly by the context switching code).

  • Which submodule should this live in? (instructions::tls? instructions::segmentation?)

I'm fine with creating a separate tls submodule. We can link it from the wrgsbase etc. methods.

@Andy-Python-Programmer
Copy link

Andy-Python-Programmer commented Jun 2, 2021

Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right? Personally I think something like this would be efficient:

static LOCAL: CpuLocal<Something> = CpuLocal::new(CpuLocalBase::FSBase, || Something::new_uninit());

fn test() {
    LOCAL.get_mut().something = 10;
    //   ^^^^^^^^^^^
    //       Returns &mut Something
     *LOCAL.get_mut() = NEW_THING;
}

impl<T> CpuLocal<T> {
    pub fn new(base: CpuLocalBase, lazy fn() -> T) -> CpuLocal<T>;
    pub fn get_mut(&self) -> &mut T;
    pub fn get(&self) -> &T;
    pub fn get_offset(&self) -> VirtAddr;
}

@josephlr
Copy link
Contributor Author

josephlr commented Jun 2, 2021

Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right?

I agree, the macros should only be needed if you need maximal performance. The basic idea would be to have the TLS structure contain a pointer to itself (as @phil-opp mentioned above, stored at some fixed offset). This what Linux does and what Fushia does on x86_64 (other architectures don't need such a pointer).

The basic idea would then be to have the following API (say in a tls module):

#[repr(C)]
pub struct Value<T> {
    ptr: *const T,
    val: T,
}

impl<T> Value<T> {
    const fn new(val: T) -> Self;
}

pub struct Wrapper<S, T>(PhantomData<(S, *mut Value<T>)>);

impl<S: Segment64, T> Wrapper<S, T> {
    pub const fn new() -> Self {
        Self(PhantomData)
    }

    /// Needs to be called on each thread to initialize the segment base
    pub unsafe fn init(&self, new: *mut Value<T>) -> *mut Value<T> {
        let old = S::read_base().as_mut_ptr();
        S::write_base(VirtAddr::from_ptr(new));
        old
    }

    /// Only safe to call if `init()` has been called on this thread
    pub unsafe fn with<R>(&'static self, f: impl FnOnce(&mut T) -> R) -> R {
        let base = S::read_u64_at_offset_zero() as *const Value<T>;
        let r: &mut T = &mut (*base).val;
        f(r)
    }
}

@Andy-Python-Programmer would that sort of API work? I think the with method is needed for similar reasons to why LocalKey::with is needed. You still don't want multiple mutable aliases to the same value.

EDIT: I don't think there's a way to have some sort of lazy initialization function in the constructor. We wouldn't know when the value had been initialized, so we wouldn't know when to call the constructor. We can't use an AtomicBool flag in the Wrapper type as that would end up being a global flag rather than a per-thread flag.

@Andy-Python-Programmer
Copy link

Andy-Python-Programmer commented Jun 2, 2021

Ah I see. The current read method takes a field. What if you want to compare two CPU local Arcs?

($wrapper:path, $field:tt) => {{
if Arc::ptr_eq(&new_process, &previous_process) {
    return false;
}

This is quite common as we want the current process to be local to the CPU

@josephlr
Copy link
Contributor Author

josephlr commented Jun 2, 2021

What if you want to compare two CPU local Arcs?

I guess you would need to copy them into the appropriate context. We could also add an as_ptr() method to Wrapper that returned a *const T. It would then be up to the user to make sure dereferencing that pointer was safe (and that init() had been called before as_ptr).

@Andy-Python-Programmer
Copy link

All good!

@josephlr
Copy link
Contributor Author

josephlr commented Jun 2, 2021

I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.

For example, consider this setup for GS-based thread-local:

#[repr(C)]
struct ThreadLocal {
    a: u64,
    b: u32,
}

const LOCAL: Wrapper<GS, ThreadLocal> = Wrapper::new();

Using the with method:

pub fn slow_get_a() -> u64 {
    unsafe { LOCAL.with(|t| t.a) }
}
pub fn slow_set_b(v: u32) {
    unsafe { LOCAL.with(|t| t.b = v) }
}

gives code that uses two movs:

example::slow_get_a:
        mov     rax, qword ptr gs:[0]
        mov     rax, qword ptr [rax]
        ret
example::slow_set_b:
        mov     rax, qword ptr gs:[0]
        mov     dword ptr [rax + 8], edi
        ret

Using the macros:

pub fn fast_get_a() -> u64 {
    unsafe { tls_read!(LOCAL.a) }
}

pub fn fast_set_b(v: u32) {
    unsafe { tls_write!(LOCAL.b, v) }
}

gives code equal to the C equivalent:

example::fast_get_a:
        mov     rax, qword ptr gs:[8]
        ret
example::fast_set_b:
        mov     dword ptr gs:[16], edi
        ret

Godbolt Link

@Andy-Python-Programmer
Copy link

The macro actually produces more optimised assembly. I see

@phil-opp
Copy link
Member

phil-opp commented Jun 5, 2021

I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.

Sounds good to me!

@josephlr josephlr added the waiting-on-author Waiting for the author to act on review feedback. label Jul 16, 2021
SlyMarbo added a commit to ProjectSerenity/firefly that referenced this pull request Dec 3, 2021
This is a fairly naive implementation, where we allocate a per-CPU
region from CPU_SPACE for each CPU (which is currently just the
first CPU, as we don't have multi-CPU support yet), then we store
a pointer to the beginning of that space in the GS base. To load
the per-CPU data, we simply read the address back out of the GS
base and dereference it to access the data.

This isn't as efficient as the approach being discussed in
rust-osdev/x86_64#257, but it works well enough for now.
SlyMarbo added a commit to ProjectSerenity/firefly that referenced this pull request Feb 10, 2022
This is a fairly naive implementation, where we allocate a per-CPU
region from CPU_SPACE for each CPU (which is currently just the
first CPU, as we don't have multi-CPU support yet), then we store
a pointer to the beginning of that space in the GS base. To load
the per-CPU data, we simply read the address back out of the GS
base and dereference it to access the data.

This isn't as efficient as the approach being discussed in
rust-osdev/x86_64#257, but it works well enough for now.

Signed-off-by: SlyMarbo <the.sly.marbo@googlemail.com>
SlyMarbo added a commit to ProjectSerenity/firefly that referenced this pull request Feb 11, 2022
This is a fairly naive implementation, where we allocate a per-CPU
region from CPU_SPACE for each CPU (which is currently just the
first CPU, as we don't have multi-CPU support yet), then we store
a pointer to the beginning of that space in the GS base. To load
the per-CPU data, we simply read the address back out of the GS
base and dereference it to access the data.

This isn't as efficient as the approach being discussed in
rust-osdev/x86_64#257, but it works well enough for now.

Signed-off-by: SlyMarbo <the.sly.marbo@googlemail.com>
SlyMarbo added a commit to ProjectSerenity/firefly that referenced this pull request Feb 11, 2022
This is a fairly naive implementation, where we allocate a per-CPU
region from CPU_SPACE for each CPU (which is currently just the
first CPU, as we don't have multi-CPU support yet), then we store
a pointer to the beginning of that space in the GS base. To load
the per-CPU data, we simply read the address back out of the GS
base and dereference it to access the data.

This isn't as efficient as the approach being discussed in
rust-osdev/x86_64#257, but it works well enough for now.

Signed-off-by: SlyMarbo <the.sly.marbo@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-on-author Waiting for the author to act on review feedback.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants