Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send/Sync for Device #888

Open
blogle opened this issue Nov 10, 2023 · 1 comment
Open

Send/Sync for Device #888

blogle opened this issue Nov 10, 2023 · 1 comment

Comments

@blogle
Copy link

blogle commented Nov 10, 2023

Currently it is not possible to move devices or tensors across threads.

pub struct Cuda {
pub(crate) cpu: Cpu,
pub(crate) dev: Arc<CudaDevice>,
pub(crate) blas: Arc<CudaBlas>,
#[cfg(feature = "cudnn")]
#[allow(unused)]
pub(crate) cudnn: Arc<cudarc::cudnn::Cudnn>,
/// A second stream for kernels to optionally execute on.
pub(crate) par_stream: Arc<CudaStream>,
pub(crate) workspace: Arc<Mutex<CudaSlice<u8>>>,
pub(crate) cache: Arc<TensorCache<CUdeviceptr>>,
}

Despite everything in the Device being wrapped in an Arc, the Cuda device is not actually Send or Sync - almost certainly because some mut* are nested in there for ffi. As a result it is rather burdensome to implement various functionality. As an example: if you want to implement a pipeline where tensors are prepared in thread A, while inference is done in thread B - tensors cant be moved across threads copying to the host and back (serializing to a vec), and device methods cant be called from both threads.

I am not familiar enough with the underlying implementations to understand if the device can implement Sync, but it would be great if at a minimum the device could be sent across threads. As of right now I could certainly just create a device per thread, but I am not sure how much overhead is associated with doing so - or the implication of not sharing the caches across instances.

@blogle
Copy link
Author

blogle commented Nov 10, 2023

I am realizing for the example I described, you wouldn't actually get any parallelism from the addition of threads since the copy (assuming its synchronous) and inference kernels will be interleaved on the same cuda stream.

Nevertheless it would be great if something like the following pseudo code was possible

fn inference_server(results: Sender<Tensor>) -> Sender<Request> {
  let dev = dfdx::AutoDevice::default();
  let model = dev.build_module::<ResNet, f32>();
  let (tx, rx) = tokio::sync::mpsc::channel(256);
  let inferencer = UnboundedReceiverStream(rx)
      .map(|data| preprocess(data))
      .ready_chunks(32)
      .map(|batch_vec| dev.tensor(batch_vec))
      .for_each(|tensor| tokio::spawn_blocking(|| {
          results.send(model.forward(tensor))
      }));
      
  tokio::spawn(inferencer);
      
  tx
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant