Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully support OwnedFd #2417

Open
anti-entropy123 opened this issue Oct 9, 2023 · 55 comments
Open

Fully support OwnedFd #2417

anti-entropy123 opened this issue Oct 9, 2023 · 55 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@anti-entropy123
Copy link
Contributor

ref: issue #2317, PR #2369

Introducation
Since nix version 0.27.1, some APIs (e.g., nix::pty::openpty(), nix::sys::socket::socket()) have started using OwnedFd to manage FDs. However, it is not straightforward to pass OwnedFd between forks. So in PR #2369, we still use RawFd in the subsequent steps.

This issue try to do the change from RawFd to OwnedFd.

@anti-entropy123
Copy link
Contributor Author

anti-entropy123 commented Oct 9, 2023

Some ideas
The main challenge with using OwnedFd is that ContainerArgs requires its fields to satisfy the Clone trait, but OwnedFd does not (and shouldn't, in my opinion) support it.
Maybe we can impl Clone for structs like Sender, in which we can use OwnedFd::try_clone() to safely share underlying sockets or connections. If we do this, we also need to check whether old FD will be leaked due to CloneCb.

@utam0k utam0k added help wanted Extra attention is needed good first issue Good for newcomers labels Oct 10, 2023
@oirom
Copy link

oirom commented Dec 21, 2023

Hi, can i try this issue??
I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

@utam0k
Copy link
Member

utam0k commented Dec 21, 2023

Don't worry ;)

I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

@anti-entropy123 what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

impl Clone for ContainerType {
    fn clone(&self) -> Self {
        match self {
            Self::InitContainer => Self::InitContainer,
            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer { exec_notify_fd: exec_notify_fd.try_clone().unwrap() },
        }
    }
}

is unwrapping fine?

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close?
if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

does it make sense to do this maybe?

pub enum ContainerType {
    InitContainer,
    TenantContainer { exec_notify_fd: Rc<OwnedFd> },
}

@anti-entropy123
Copy link
Contributor Author

what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

Directly unwrapping indeed doesn't seem quite appropriate...

Could we just implement a try_clone method for structures like ContainerType (should also include ContainerArgs), rather than derive(Clone)? This way, errors can be passed, reusing the error handling logic from before. 👀

This suggestion might not be fully matured, but I hope it helps you. @zahash

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

@anti-entropy123 what about this issue then?

@anti-entropy123
Copy link
Contributor Author

what about this issue then?

I think closing one of the file descriptors won't affect the usability of the other one.

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

@anti-entropy123

but the OwnedFd calls close on the wrapped RawFd when dropped.

#[stable(feature = "io_safety", since = "1.63.0")]
impl Drop for OwnedFd {
    #[inline]
    fn drop(&mut self) {
        unsafe {
            // Note that errors are ignored when closing a file descriptor. The
            // reason for this is that if an error occurs we don't actually know if
            // the file descriptor was closed or not, and if we retried (for
            // something like EINTR), we might close another valid file descriptor
            // opened after we closed ours.
            #[cfg(not(target_os = "hermit"))]
            let _ = libc::close(self.fd);
            #[cfg(target_os = "hermit")]
            let _ = hermit_abi::close(self.fd);
        }
    }
}

@anti-entropy123
Copy link
Contributor Author

@zahash Sorry, I don't have a suitable Linux machine with me right now. I can try verifying the issue tomorrow morning.

TenantContainer { exec_notify_fd: Rc },

I've discussed this idea before; you can refer to this link for more information: #2369 (comment)

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

@anti-entropy123 @utam0k does cargo test run all the available tests?

@zahash
Copy link
Contributor

zahash commented Mar 25, 2024

also, mmap no longer accepts f: Option<F> and instead just accepts f: F
and with in mmap, it tries to unwrap_or(-1) to get the raw fd.

but now, i can't find a way to create a invalid Fd OwnedFd::from_raw_fd(-1) panics at runtime because rust doesn't allow -1 (asserts that fd != -1)

        mman::mmap(
            None,
            NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
            mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
            mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
            OwnedFd::from_raw_fd(-1), // used to be None
            0,
        )
        .map_err(CloneError::StackAllocation)?

so i decided to be a little naughty and use OwnedFd::from_raw_fd(-2) and there are no runtime panics due to assertions and all the tests are passing.

is this fine?

OwnedFd::from_raw_fd(RawFd::MAX) also works

@anti-entropy123
Copy link
Contributor Author

@zahash I think we can try using mmap_anonymous https://docs.rs/nix/latest/nix/sys/mman/fn.mmap_anonymous.html

@anti-entropy123
Copy link
Contributor Author

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

I tested this code snippet, and it ran fine.

fn verify_try_clone(fd: OwnedFd) {
    let new_fd = fd.try_clone().unwrap();
    drop(fd);

    let mut new_file = unsafe { File::from_raw_fd(new_fd.as_raw_fd()) };
    new_file.write_all("hello world".as_bytes()).unwrap();
}

@anti-entropy123
Copy link
Contributor Author

but the OwnedFd calls close on the wrapped RawFd when dropped.

@zahash The try_clone method of OwnedFd ultimately calls a syscall similar to dup2, so in the end, there are indeed two separate file descriptors. https://man7.org/linux/man-pages/man2/dup.2.html

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@anti-entropy123 @utam0k

i was also thinking about having a ref datatype

pub enum ContainerTypeRef<'fd> {
    InitContainer,
    TenantContainer { exec_notify_id: BorrowedFd<'fd> }
}

that way, we may not have to necessarily clone all the time.

but the CloneCb in container_intermediate_process requires cloning the args.

    let cb: CloneCb = {
        let args = args.clone();
        ...
    };

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@anti-entropy123 @utam0k

i have made changes related to issues #2417 and #2723

all the tests pass. Please checkout the patch file. if you like where this is going and satisfied with most of the changes, i will create a pull request for further review.

diff --git a/crates/libcgroups/src/v1/memory.rs b/crates/libcgroups/src/v1/memory.rs
index 98fa6b1e..38b0dd07 100644
--- a/crates/libcgroups/src/v1/memory.rs
+++ b/crates/libcgroups/src/v1/memory.rs
@@ -332,7 +332,7 @@ impl Memory {
             Err(e) => {
                 // we need to look into the raw OS error for an EBUSY status
                 match e.inner().raw_os_error() {
-                    Some(code) => match Errno::from_i32(code) {
+                    Some(code) => match Errno::from_raw(code) {
                         Errno::EBUSY => {
                             let usage = Self::get_memory_usage(cgroup_root)?;
                             let max_usage = Self::get_memory_max_usage(cgroup_root)?;
diff --git a/crates/libcontainer/src/container/builder_impl.rs b/crates/libcontainer/src/container/builder_impl.rs
index 38f2f1cc..59d7a062 100644
--- a/crates/libcontainer/src/container/builder_impl.rs
+++ b/crates/libcontainer/src/container/builder_impl.rs
@@ -132,7 +132,7 @@ impl ContainerBuilderImpl {
             prctl::set_dumpable(false).map_err(|e| {
                 LibcontainerError::Other(format!(
                     "error in setting dumpable to false : {}",
-                    nix::errno::from_i32(e)
+                    nix::errno::Errno::from_raw(e)
                 ))
             })?;
         }
@@ -141,7 +141,7 @@ impl ContainerBuilderImpl {
         // therefore we will have to move all the variable by value. Since self
         // is a shared reference, we have to clone these variables here.
         let container_args = ContainerArgs {
-            container_type: self.container_type,
+            container_type: self.container_type.clone(),
             syscall: self.syscall,
             spec: Rc::clone(&self.spec),
             rootfs: self.rootfs.to_owned(),
diff --git a/crates/libcontainer/src/container/tenant_builder.rs b/crates/libcontainer/src/container/tenant_builder.rs
index 0d805a0d..2c2dbc88 100644
--- a/crates/libcontainer/src/container/tenant_builder.rs
+++ b/crates/libcontainer/src/container/tenant_builder.rs
@@ -1,6 +1,6 @@
 use caps::Capability;
 use nix::fcntl::OFlag;
-use nix::unistd::{self, close, pipe2, read, Pid};
+use nix::unistd::{self, pipe2, read, Pid};
 use oci_spec::runtime::{
     Capabilities as SpecCapabilities, Capability as SpecCapability, LinuxBuilder,
     LinuxCapabilities, LinuxCapabilitiesBuilder, LinuxNamespace, LinuxNamespaceBuilder,
@@ -8,6 +8,7 @@ use oci_spec::runtime::{
 };
 use procfs::process::Namespace;
 
+use std::os::fd::AsRawFd;
 use std::rc::Rc;
 use std::{
     collections::HashMap,
@@ -148,13 +149,13 @@ impl TenantContainerBuilder {
         let mut notify_socket = NotifySocket::new(notify_path);
         notify_socket.notify_container_start()?;
 
-        close(write_end).map_err(LibcontainerError::OtherSyscall)?;
+        drop(builder_impl); // this will close write_end fd
 
         let mut err_str_buf = Vec::new();
 
         loop {
             let mut buf = [0; 3];
-            match read(read_end, &mut buf).map_err(LibcontainerError::OtherSyscall)? {
+            match read(read_end.as_raw_fd(), &mut buf).map_err(LibcontainerError::OtherSyscall)? {
                 0 => {
                     if err_str_buf.is_empty() {
                         return Ok(pid);
diff --git a/crates/libcontainer/src/process/args.rs b/crates/libcontainer/src/process/args.rs
index ad37c07b..11c5d7f2 100644
--- a/crates/libcontainer/src/process/args.rs
+++ b/crates/libcontainer/src/process/args.rs
@@ -1,5 +1,6 @@
 use libcgroups::common::CgroupConfig;
 use oci_spec::runtime::Spec;
+use std::os::fd::OwnedFd;
 use std::os::unix::prelude::RawFd;
 use std::path::PathBuf;
 use std::rc::Rc;
@@ -9,10 +10,21 @@ use crate::notify_socket::NotifyListener;
 use crate::syscall::syscall::SyscallType;
 use crate::user_ns::UserNamespaceConfig;
 use crate::workload::Executor;
-#[derive(Debug, Copy, Clone)]
+#[derive(Debug)]
 pub enum ContainerType {
     InitContainer,
-    TenantContainer { exec_notify_fd: RawFd },
+    TenantContainer { exec_notify_fd: OwnedFd },
+}
+
+impl Clone for ContainerType {
+    fn clone(&self) -> Self {
+        match self {
+            Self::InitContainer => Self::InitContainer,
+            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer {
+                exec_notify_fd: exec_notify_fd.try_clone().unwrap(/* is unwrap ok? */),
+            },
+        }
+    }
 }
 
 #[derive(Clone)]
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..7e5ac9c0 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -1,3 +1,5 @@
+use std::os::fd::AsRawFd;
+
 use crate::error::MissingSpecError;
 use crate::{namespaces::Namespaces, process::channel, process::fork};
 use libcgroups::common::CgroupManager;
@@ -128,12 +130,12 @@ pub fn container_intermediate_process(
                     if let Err(err) = main_sender.exec_failed(e.to_string()) {
                         tracing::error!(?err, "failed sending error to main sender");
                     }
-                    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
+                    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
                         let buf = format!("{e}");
-                        if let Err(err) = write(exec_notify_fd, buf.as_bytes()) {
+                        if let Err(err) = write(&exec_notify_fd, buf.as_bytes()) {
                             tracing::error!(?err, "failed to write to exec notify fd");
                         }
-                        if let Err(err) = close(exec_notify_fd) {
+                        if let Err(err) = close(exec_notify_fd.as_raw_fd()) {
                             tracing::error!(?err, "failed to close exec notify fd");
                         }
                     }
@@ -157,8 +159,8 @@ pub fn container_intermediate_process(
     })?;
 
     // Close the exec_notify_fd in this process
-    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
-        close(exec_notify_fd).map_err(|err| {
+    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
+        close(exec_notify_fd.as_raw_fd()).map_err(|err| {
             tracing::error!("failed to close exec notify fd: {}", err);
             IntermediateProcessError::ExecNotify(err)
         })?;
@@ -206,7 +208,7 @@ fn setup_userns(
     prctl::set_dumpable(true).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumpable to true : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     sender.identifier_mapping_request().map_err(|err| {
@@ -220,7 +222,7 @@ fn setup_userns(
     prctl::set_dumpable(false).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumplable to false : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     Ok(())
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..60a33f46 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -1,4 +1,4 @@
-use std::{ffi::c_int, fs::File, num::NonZeroUsize};
+use std::{ffi::c_int, num::NonZeroUsize};
 
 use libc::SIGCHLD;
 use nix::{
@@ -164,15 +164,17 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
     // do not use MAP_GROWSDOWN since it is not well supported.
     // Ref: https://man7.org/linux/man-pages/man2/mmap.2.html
     let child_stack = unsafe {
-        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd`.
-        // `::<File>` doesn't have any meaning because we won't use it.
-        mman::mmap::<File>(
+        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd` and takes f: Option<F>.
+        // if f: None, then fd = -1 is used; (-1 is just an invalid fd)
+        // let fd = f.map(|f| f.as_fd().as_raw_fd()).unwrap_or(-1);
+        // Since nix = "0.28.0" mmap function accepts f: F instead of Option<F>
+        // but it is not possible to manually create OwnedFd::from_raw_fd(-1) because rust will panic due to assert
+        // So, mmap_anonymous is used which achieves the exact same behaviour
+        mman::mmap_anonymous(
             None,
             NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
             mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
             mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
-            None,
-            0,
         )
         .map_err(CloneError::StackAllocation)?
     };
@@ -187,7 +189,7 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
 
     // Since the child stack for clone grows downward, we need to pass in
     // the top of the stack address.
-    let child_stack_top = unsafe { child_stack.add(default_stack_size) };
+    let child_stack_top = unsafe { child_stack.as_ptr().add(default_stack_size) };
 
     // Combine the clone flags with exit signals.
     let combined_flags = (flags | exit_signal.unwrap_or(0)) as c_int;
diff --git a/crates/libcontainer/src/seccomp/mod.rs b/crates/libcontainer/src/seccomp/mod.rs
index 2bfce4cb..a1acb131 100644
--- a/crates/libcontainer/src/seccomp/mod.rs
+++ b/crates/libcontainer/src/seccomp/mod.rs
@@ -332,7 +332,7 @@ mod tests {
             }
 
             if let Some(errno) = ret.err() {
-                if errno != nix::errno::from_i32(expect_error) {
+                if errno != nix::errno::Errno::from_raw(expect_error) {
                     Err(TestCallbackError::Custom(format!(
                         "getcwd failed but we didn't get the expected error from seccomp profile: {}",
                         errno
diff --git a/crates/libcontainer/src/syscall/linux.rs b/crates/libcontainer/src/syscall/linux.rs
index d01c4a54..35e5430c 100644
--- a/crates/libcontainer/src/syscall/linux.rs
+++ b/crates/libcontainer/src/syscall/linux.rs
@@ -314,7 +314,7 @@ impl Syscall for LinuxSyscall {
     fn set_id(&self, uid: Uid, gid: Gid) -> Result<()> {
         prctl::set_keep_capabilities(true).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to true");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         // args : real *id, effective *id, saved set *id respectively
 
@@ -350,7 +350,7 @@ impl Syscall for LinuxSyscall {
         }
         prctl::set_keep_capabilities(false).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to false");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         Ok(())
     }
diff --git a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
index 10664b2e..60337c4d 100644
--- a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
+++ b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
@@ -1,7 +1,7 @@
 use anyhow::{bail, Context, Result};
 use libcontainer::container::ContainerProcessState;
 use nix::{
-    sys::socket::{self, UnixAddr},
+    sys::socket::{self, Backlog, UnixAddr},
     unistd,
 };
 use std::{
@@ -35,7 +35,11 @@ pub fn recv_seccomp_listener(seccomp_listener: &Path) -> SeccompAgentResult {
     socket::bind(socket.as_raw_fd(), &addr).context("failed to bind to seccomp listener socket")?;
     // Force the backlog to be 1 so in the case of an error, only one connection
     // from clients will be waiting.
-    socket::listen(&socket.as_fd(), 1).context("failed to listen on seccomp listener")?;
+    socket::listen(
+        &socket.as_fd(),
+        Backlog::new(1).unwrap( /* safe to unwrap because Backlog(1) is always valid */ ),
+    )
+    .context("failed to listen on seccomp listener")?;
     let conn = match socket::accept(socket.as_raw_fd()) {
         Ok(conn) => conn,
         Err(e) => {
diff --git a/tests/contest/runtimetest/src/tests.rs b/tests/contest/runtimetest/src/tests.rs
index dee3afc7..d1d2f2ce 100644
--- a/tests/contest/runtimetest/src/tests.rs
+++ b/tests/contest/runtimetest/src/tests.rs
@@ -34,7 +34,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
     // change manual matching of i32 to e.kind() and match statement
     for path in ro_paths {
         if let std::io::Result::Err(e) = test_read_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // to be specified in the spec, so we allow ENOENT here
             if errno == Errno::ENOENT {
@@ -50,7 +50,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
         }
 
         if let std::io::Result::Err(e) = test_write_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // being specified in the spec, so we allow ENOENT, and we expect EROFS as the paths
             // should be read-only

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Mar 26, 2024

Hey @zahash , I'd suggest you wait till #2728 is merged, as that changes some of the things mentioned here, eg the mmap_anonymous. Also instead of using patch files, please either submit a PR, so link to a branch in your repo, as that would be easier for everyone to check and see.

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@YJDoc2 oh, didn't realize there was already a pr to update nix. Fine then i will wait till it gets merged and then continue my work on the "Fully support OwnedFd" part

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

also, @utam0k can i also be assigned to this issue?

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Mar 26, 2024

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

@anti-entropy123
Copy link
Contributor Author

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)

@oirom
Copy link

oirom commented Mar 26, 2024

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

Now this issue is too difficult to me, so please re-assign to @zahash

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Mar 26, 2024

Now this issue is too difficult to me, so please re-assign to @zahash

No worries, feel free to take a look at some other issues such as #361 if you still want to contribute!

@zahash , assigning to you 👍

@YJDoc2 YJDoc2 assigned zahash and unassigned oirom Mar 26, 2024
@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)

@anti-entropy123 what about this? in the below line?

unsafe { drop(Box::from_raw(data)) };

@anti-entropy123
Copy link
Contributor Author

I think it seems only the parent process can reach this point. Additionally, this is still dropping CloneCb rather than releasing data outside of the closure.

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@anti-entropy123

i don't think i fully understand what you are saying. can you please explain what you mean by "data outside of the closure"

can you also please explain a bit what the clone and clone3 functions are doing? its a bit hard to understand 😅

@zahash
Copy link
Contributor

zahash commented Mar 26, 2024

@YJDoc2 is it feasible to use OwnedFd in Sender and Receiver? or is it too big of a change?

pub struct Receiver<T> {
    receiver: OwnedFd,
    phantom: PhantomData<T>,
}

pub struct Sender<T> {
    sender: OwnedFd,
    phantom: PhantomData<T>,
}

but OwnedFd cannot be cloned, especially when it comes to creating a CloneCb in container_main_process and container_intermediate_process. So, what should be the strategy here?

@anti-entropy123
Copy link
Contributor Author

anti-entropy123 commented Mar 27, 2024

@zahash Here's my understanding of execution steps, but it might not be accurate...

  1. When creating CloneCb, container_args along with its internal OwnedFd are cloned, and then moved into the closure.
  2. Subsequently, clone() is called to create a child process. The child process inherits all files opened by the parent process. (You can confirm this by logging)
fn get_current_fds_nums() -> usize {
    let proc_fd_dir = format!("/proc/{}/fd", process::id());
    fs::read_dir(proc_fd_dir).unwrap().collect::<Vec<_>>().len()
}

image
image
image
3. However, the entrypoint of the child process is set to a special main function, which simply calls the provided closure. The child process loses ownership of any variables outside of the closure. (This includes the original container_args as well.) So, it doesn't have the opportunity to execute the drop() function for those variables.
image

Even though clone3() or clone() might be executed based on the system environment, the result should be the same.

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123

i tried to avoid cloning the ContainerArgs in the CloneCb and it complains that the callback cb might outlive the &args (or atleast thats what rust thinks).

let cb: CloneCb = {
        Box::new(move || {
            // ...
        })
    };

but aren't we are waiting for the child process to finish so that it won't outlive the callback? is there a "idiomatic" way to do this so rust can understand?

    let pid = fork::container_clone_sibling(cb).map_err(|err| {
        tracing::error!("failed to fork init process: {}", err);
        IntermediateProcessError::InitProcess(err)
    })?;

    // ...

    main_sender.intermediate_ready(pid).map_err(|err| {
        tracing::error!("failed to wait on intermediate process: {}", err);
        err
    })?;

i can try adding a lifetime parameter to CloneCb

pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

Also, cloning a OwnedFd only fails on wasm.

impl OwnedFd {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `OwnedFd` instance.
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone(&self) -> crate::io::Result<Self> {
        self.as_fd().try_clone_to_owned()
    }
}

impl BorrowedFd<'_> {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(not(any(target_arch = "wasm32", target_os = "hermit")))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        // We want to atomically duplicate this file descriptor and set the
        // CLOEXEC flag, and currently that's done via F_DUPFD_CLOEXEC. This
        // is a POSIX flag that was added to Linux in 2.6.24.
        #[cfg(not(target_os = "espidf"))]
        let cmd = libc::F_DUPFD_CLOEXEC;

        // For ESP-IDF, F_DUPFD is used instead, because the CLOEXEC semantics
        // will never be supported, as this is a bare metal framework with
        // no capabilities for multi-process execution. While F_DUPFD is also
        // not supported yet, it might be (currently it returns ENOSYS).
        #[cfg(target_os = "espidf")]
        let cmd = libc::F_DUPFD;

        // Avoid using file descriptors below 3 as they are used for stdio
        let fd = cvt(unsafe { libc::fcntl(self.as_raw_fd(), cmd, 3) })?;
        Ok(unsafe { OwnedFd::from_raw_fd(fd) })
    }

    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(any(target_arch = "wasm32", target_os = "hermit"))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        Err(crate::io::const_io_error!(
            crate::io::ErrorKind::Unsupported,
            "operation not supported on WASI yet",
        ))
    }
}

@anti-entropy123
Copy link
Contributor Author

is there a "idiomatic" way to do this so rust can understand?

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack. So, at the moment, I don't have a good idea to solve this problem... (Apart from using Rc and manually adjusting the reference count.) @zahash

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack.

@anti-entropy123

i thought clone creates a process identical to the parent with all the local variables.
but they both will have separate memory spaces so modification done on the child won't affect parent.
besides, i don't think we are doing any modifications. (i might be wrong).

also, the CloneCb closure is boxed and we send the actual function pointer of the closure to libc. so shouldn't the closure retain all the references to the enclosed variables in the function body.

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123 i did this and all the tests pass. I basically removed #[derive(Clone)] from Senders and Receivers and added a lifetime to CloneCb

@YJDoc2 i can't create a PR right now because this hasn't merged yet #2728 and a lot of my changes depend on it. sorry for using a patch again.

diff --git a/crates/libcontainer/src/channel.rs b/crates/libcontainer/src/channel.rs
index 23d255f9..427912d6 100644
--- a/crates/libcontainer/src/channel.rs
+++ b/crates/libcontainer/src/channel.rs
@@ -18,13 +18,11 @@ pub enum ChannelError {
     #[error("channel connection broken")]
     BrokenChannel,
 }
-#[derive(Clone)]
 pub struct Receiver<T> {
     receiver: RawFd,
     phantom: PhantomData<T>,
 }
 
-#[derive(Clone)]
 pub struct Sender<T> {
     sender: RawFd,
     phantom: PhantomData<T>,
diff --git a/crates/libcontainer/src/process/channel.rs b/crates/libcontainer/src/process/channel.rs
index 9683e5ed..b1bd2e88 100644
--- a/crates/libcontainer/src/process/channel.rs
+++ b/crates/libcontainer/src/process/channel.rs
@@ -40,7 +40,6 @@ pub fn main_channel() -> Result<(MainSender, MainReceiver), ChannelError> {
     Ok((MainSender { sender }, MainReceiver { receiver }))
 }
 
-#[derive(Clone)]
 pub struct MainSender {
     sender: Sender<Message>,
 }
@@ -88,7 +87,6 @@ impl MainSender {
     }
 }
 
-#[derive(Clone)]
 pub struct MainReceiver {
     receiver: Receiver<Message>,
 }
@@ -199,7 +197,6 @@ pub fn intermediate_channel() -> Result<(IntermediateSender, IntermediateReceive
     ))
 }
 
-#[derive(Clone)]
 pub struct IntermediateSender {
     sender: Sender<Message>,
 }
@@ -219,7 +216,6 @@ impl IntermediateSender {
     }
 }
 
-#[derive(Clone)]
 pub struct IntermediateReceiver {
     receiver: Receiver<Message>,
 }
@@ -256,7 +252,6 @@ pub fn init_channel() -> Result<(InitSender, InitReceiver), ChannelError> {
     Ok((InitSender { sender }, InitReceiver { receiver }))
 }
 
-#[derive(Clone)]
 pub struct InitSender {
     sender: Sender<Message>,
 }
@@ -275,7 +270,6 @@ impl InitSender {
     }
 }
 
-#[derive(Clone)]
 pub struct InitReceiver {
     receiver: Receiver<Message>,
 }
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..8d9a015c 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -99,12 +99,7 @@ pub fn container_intermediate_process(
     }
 
     let cb: CloneCb = {
-        let args = args.clone();
-        let init_sender = init_sender.clone();
-        let inter_sender = inter_sender.clone();
-        let mut main_sender = main_sender.clone();
-        let mut init_receiver = init_receiver.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[2:INIT]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
@@ -121,7 +116,7 @@ pub fn container_intermediate_process(
                 tracing::error!(?err, "failed to close sender in the intermediate process");
                 return -1;
             }
-            match container_init_process(&args, &mut main_sender, &mut init_receiver) {
+            match container_init_process(&args, main_sender, init_receiver) {
                 Ok(_) => 0,
                 Err(e) => {
                     tracing::error!("failed to initialize container process: {e}");
diff --git a/crates/libcontainer/src/process/container_main_process.rs b/crates/libcontainer/src/process/container_main_process.rs
index f7066053..41fa96ab 100644
--- a/crates/libcontainer/src/process/container_main_process.rs
+++ b/crates/libcontainer/src/process/container_main_process.rs
@@ -42,16 +42,12 @@ pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bo
     // cloned process, we have to be deligent about closing any unused channel.
     // At minimum, we have to close down any unused senders. The corresponding
     // receivers will be cleaned up once the senders are closed down.
-    let (main_sender, mut main_receiver) = channel::main_channel()?;
-    let inter_chan = channel::intermediate_channel()?;
-    let init_chan = channel::init_channel()?;
+    let (mut main_sender, mut main_receiver) = channel::main_channel()?;
+    let mut inter_chan = channel::intermediate_channel()?;
+    let mut init_chan = channel::init_channel()?;
 
     let cb: CloneCb = {
-        let container_args = container_args.clone();
-        let mut main_sender = main_sender.clone();
-        let mut inter_chan = inter_chan.clone();
-        let mut init_chan = init_chan.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[1:INTER]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..2a69d8d2 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -33,7 +33,7 @@ pub enum CloneError {
 /// pass in a child stack, which is empty. By storing the closure in heap, we
 /// can then in the new process to re-box the heap memory back to a closure
 /// correctly.
-pub type CloneCb = Box<dyn FnMut() -> i32>;
+pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;
 
 // Clone a sibling process that shares the same parent as the calling
 // process. This is used to launch the container init process so the parent

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

the lifetime in CloneCb<_> is only there to satisfy rust. i think the actual reason this is working is because we actually wait for the child process to finish.

    let pid = fork::container_clone_sibling(cb).map_err(|err| {
        tracing::error!("failed to fork init process: {}", err);
        IntermediateProcessError::InitProcess(err)
    })?;

    // ...

    main_sender.intermediate_ready(pid).map_err(|err| {
        tracing::error!("failed to wait on intermediate process: {}", err);
        err
    })?;

without this, the child might outlive the parent and become zombie. and rust lifetimes won't save us.

maybe there is a better way to do this so that rust understands the "waiting" part is important for the lifetime.

@anti-entropy123
Copy link
Contributor Author

anti-entropy123 commented Mar 27, 2024

OK, I agree with what you're saying, and since your code doesn't clone the container_args variable, there should be no copying of file descriptors. Therefore, fd leaks shouldn't occur. 👍

However, I'd still like to ask if you're aware of whether you're actually calling clone3() or clone()?

// An internal wrapper to manage the clone3 vs clone fallback logic.
fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    match clone3(&mut cb, flags, exit_signal) {
        ...
        // For now, we decide to only fallback on ENOSYS
        Err(CloneError::Clone(nix::Error::ENOSYS)) => {
            ...
            let pid = clone(cb, flags, exit_signal)?;

Since in clone(), a new stack is allocated for the child process,

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    let ret = unsafe {
            libc::clone(
                main,
                child_stack_top,
                combined_flags,
                data as *mut libc::c_void,
            )
        };
    ...

wouldn't the closure's references to local variables (e.g., &container_args) become invalid? @zahash

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123

I have the exact same question 😅😅

do you know how clone3 vs clone differences are? and since the closure has references to variables in the paren'ts memory, do they also get copied to the child memory?

i can see that clone3 also takes args but clone doesn't. (but i don't think we are passing anything related in it)

but why do the tests pass? do they even check this part of the code?

@anti-entropy123
Copy link
Contributor Author

I think your test device should support clone3(), so it likely called clone3() in reality. And clone3() seems to completely copy the parent process's memory, which is why your test passed.

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123 is there a way to force it to use clone instead?

never mind. i'll just comment out the clone3 part

@anti-entropy123
Copy link
Contributor Author

But I'm not entirely sure why clone() necessitates switching to a new stack; it could be a requirement of the syscall itself (? 🤔 ).

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123 all the tests still pass

fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    clone(cb, flags, exit_signal)
}

@anti-entropy123
Copy link
Contributor Author

all the tests still pass

Amazing... I'm curious about the reason. @zahash

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123

Amazing... I'm curious about the reason

I'm equally curious. 😂

maybe this is undefined behavior.

@anti-entropy123
Copy link
Contributor Author

maybe this is undefined behavior

My expectation was that a segmentation fault would occur.

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123 if found this on the kernel man page

The clone() wrapper function
       When the child process is created with the clone() wrapper
       function, it commences execution by calling the function pointed
       to by the argument fn.  (This differs from [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html), where
       execution continues in the child from the point of the [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html)
       call.)  The arg argument is passed as the argument of the
       function fn.

       When the fn(arg) function returns, the child process terminates.
       The integer returned by fn is the exit status for the child
       process.  The child process may also terminate explicitly by
       calling [exit(2)](https://man7.org/linux/man-pages/man2/exit.2.html) or after receiving a fatal signal.

       The stack argument specifies the location of the stack used by
       the child process.  Since the child and calling process may share
       memory, it is not possible for the child process to execute in
       the same stack as the calling process.  The calling process must
       therefore set up memory space for the child stack and pass a
       pointer to this space to clone().  Stacks grow downward on all
       processors that run Linux (except the HP PA processors), so stack
       usually points to the topmost address of the memory space set up
       for the child stack.  Note that clone() does not provide a means
       whereby the caller can inform the kernel of the size of the stack
       area.

https://man7.org/linux/man-pages/man2/clone.2.html

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

what about others? what do they think? @utam0k @YJDoc2 @yihuaf

@anti-entropy123
Copy link
Contributor Author

I think I understand now. Even though clone() assigns a new stack to the child process, the original memory region (the old stack) may still be valid. So, those references remain effective.
Do you think so? @zahash

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Mar 27, 2024

Hey, will need some time to read all the discussion above and consider it. Probably would get back on this around the weekend....

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123

maybe. but how can one process directly access the memory of another process? its not allowed right?

or maybe its allowed in "parent - child" processes?

or maybe its because, the child gets an exact copy of the parent's memory when clone/fork

@anti-entropy123
Copy link
Contributor Author

No, what I mean is that the old stack is also copied to the child process. The address spaces of the parent and child processes are also isolated. I'm planning to observe the maps of both processes. @zahash

@zahash
Copy link
Contributor

zahash commented Mar 27, 2024

@anti-entropy123

I'm planning to observe the maps of both processes

yes please. that would be very helpful 🥰

@anti-entropy123
Copy link
Contributor Author

@zahash It seems our guess was correct.
I made some changes to make testing easier for me:
image
When running youki create, the process prints logs:
image
Check the maps file for both the parent and child processes separately (using sudo cat /proc//maps):

  • Parent:
    sudo cat /proc/26778/maps
    ...
    5610cba8c000-5610cbacf000 rw-p 00000000 00:00 0 [heap]
    7f7116e68000-7f7116e69000 ---p 00000000 00:00 0
    7f7116e69000-7f711766b000 rw-p 00000000 00:00 0
    ...
    7ffccdd65000-7ffccdd9b000 rw-p 00000000 00:00 0 [stack]
    ...

  • Child:
    sudo cat /proc/26779/maps | grep rw
    ...
    5610cba8c000-5610cbacf000 rw-p 00000000 00:00 0 [heap]
    7f7116e69000-7f711766b000 rw-p 00000000 00:00 0
    ...
    7f7117a6b000-7f7117a6c000 rw-p 00000000 00:00 0
    7ffccdd65000-7ffccdd9b000 rw-p 00000000 00:00 0 [stack]

Both the memory region containing the local variables of the child process (0x7f71176678c4) and the region referencing the local variables of the parent process (located at 7ffccdd65000-7ffccdd9b000) appear to be valid. So now it all makes sense. 🎉

@utam0k
Copy link
Member

utam0k commented Mar 30, 2024

@anti-entropy123 @utam0k does cargo test run all the available tests?

No, you can see our test coverage here: https://app.codecov.io/gh/containers/youki

@utam0k
Copy link
Member

utam0k commented Mar 30, 2024

@anti-entropy123 @zahash
Sorry for the late coming. My understanding is a memory stack is copied from parent process's one when clone(2) is called. But be careful, it is just copy, so changes in the child process's memory don't affect the parent's one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants