Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] #9

Open
Truong-Thanh-Quang opened this issue Sep 9, 2020 · 7 comments
Open

[Question] #9

Truong-Thanh-Quang opened this issue Sep 9, 2020 · 7 comments

Comments

@Truong-Thanh-Quang
Copy link

I have a small code like this:
image

What is the best replacement for audio[0].data.cpu() ?

Thanks!

@Santosh-Gupta
Copy link
Owner

Santosh-Gupta commented Sep 9, 2020

Hmm, I'm not sure what's going on in the code, it doesn't seem to be related to speedtorch

@Truong-Thanh-Quang
Copy link
Author

Yes, ofcourse.
I want to integrate speedtorch into waveglow inference operation and need help on the best way.

Thank any way.

@kwanUm
Copy link

kwanUm commented Sep 27, 2020

Joining the question..

The project seems to give really nice gains in terms of performance, and I'm trying to use it at an inference pipeline I've created. I don't seem to understand the API and whether my aim makes sense.

Basically I have data coming in as a numpy array, and a pytorch model that performs a conversion on this data (it's an audio data). So the code looks as follows:

def process_audio(audio: np.numpy):
audio_tensor = torch.from_numpy(audio).cuda()
processed = model(audio_tensor)
processed_numpy = processed.cpu().numpy()
return processed_numpy

(The above method is called many times in a row and runs in real-time)

I was looking to reduce the runtime of the method by improving CPU-->GPU and GPU-->CPU transfer time data. Basically I wanted to optimize the following lines to take less time:

  • audio_tensor = torch.from_numpy(audio).cuda()
  • processed_numpy = processed.cpu().numpy()

Is it possible to do it with SpeedTorch?

Thanks in advance for the help!

@Santosh-Gupta
Copy link
Owner

Santosh-Gupta commented Sep 28, 2020

Joining the question..

The project seems to give really nice gains in terms of performance, and I'm trying to use it at an inference pipeline I've created. I don't seem to understand the API and whether my aim makes sense.

Basically I have data coming in as a numpy array, and a pytorch model that performs a conversion on this data (it's an audio data). So the code looks as follows:

def process_audio(audio: np.numpy):
audio_tensor = torch.from_numpy(audio).cuda()
processed = model(audio_tensor)
processed_numpy = processed.cpu().numpy()
return processed_numpy

(The above method is called many times in a row and runs in real-time)

I was looking to reduce the runtime of the method by improving CPU-->GPU and GPU-->CPU transfer time data. Basically I wanted to optimize the following lines to take less time:

  • audio_tensor = torch.from_numpy(audio).cuda()
  • processed_numpy = processed.cpu().numpy()

Is it possible to do it with SpeedTorch?

Thanks in advance for the help!

Yeah, if you're working on a CPU with lower number of cores, Speedtorch my help; but pytorch may have updated their indexing kernels by now, so benching marking is encouraged to see if SpeedTorch is actually helping.

Is it possible to have all possible audio in a single matrix? If so, this would be a good application for speedtorch. But if not, there there would be an additional step in converting the data to a 'speedtorch variable'. In that case, it might be better to due something called 'pin' the audio data on the cpu https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723 which speeds up data transfer from cpu to cuda.

If you can give me more details about what 'audio' ad what `processed_numpy' are I may be able to give some other details.

@kwanUm
Copy link

kwanUm commented Sep 29, 2020

Thank you for the reply.

In my case, unlike in training, I'm receiving and processing the incoming audio in real-time. I'm processing about 1000 samples of audio data as they come, and in order to keep the latency at a minimum - I'm loading those 1000 samples to the GPU to process them right after they come. Since the amount of options for the audio is huge, a simple matrix to store all possible values of it is not feasible (although very creative idea!).

To answer your question - 'audio' is a numpy array of data that is received from the microphone. I transform it to Tensor, load it to the GPU and move back the processed audio result to numpy --> that's 'processed_numy' variable. Note that len(audio) == len(processed_numpy) at the end of the function.

Re-pin memory. I've tried using pin_memory like the following:

def process_audio(audio: np.numpy):
  audio_tensor = torch.from_numpy(audio).pin_memory().cuda()
  processed = model(audio_tensor)
  processed_numpy = processed.cpu().numpy()
  return processed_numpy

But my real-time benchmark metrics showed latency is slightly increasing in that case.
I hope that explains everything.

@Santosh-Gupta
Copy link
Owner

Maybe you can open the data in the speedtorch data gadget, and use that to send to the gpu. But opening the data in speedtorch may take a big longer which defeats the purpose. I haven't benchmarked that part.

@palomacaste
Copy link

palomacaste commented Feb 23, 2023

Hello!

I have a very similar question to them... but it isn't clear to me whether it was resolved.
I have a unet model trained and I want to use it live, but I am encountering a massive bottleneck when translating my tensor data to numpy because of the .cpu() step. I send one image (2d matrix) at a time.

Here is my code:

inImg = torch.from_numpy(np.array(inImg))
inImg = inImg.float()
inImg = Variable(inImg.to(self.device))

with torch.no_grad():
        outImg = self.generator(inImg)

return np.squeeze(outImg.detach().cpu().numpy())

Do you think SpeedTorch could help accelerate this? And if so, could you give me an example of how to implement it for this purpose?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants