Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

a-rogalska · 2024-01-17T17:12:59Z

Tested versions

Tested on 3.1 vs 3.0

System information

Debian GNU/Linux, torch 2.1.2

Issue description

When running diarization pipeline on CPU, v3.1 is more than 2x slower than v3.0. Is it possible to make it faster?

Minimal reproduction example (MRE)

from pyannote.audio import Pipeline
import time

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0", use_auth_token=hf_token)
start = time.perf_counter()
diarization = pipeline("sample.wav")
print("\nDiarization on v3.0 took {:.2f} s\n".format(time.perf_counter() - start))

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=hf_token)
start = time.perf_counter()
diarization = pipeline("sample.wav")
print("\nDiarization on v3.1 took {:.2f} s\n".format(time.perf_counter() - start))

The text was updated successfully, but these errors were encountered:

ljnicol · 2024-01-18T05:07:44Z

I'm also having this issue. Using the above code I get:

Diarization on v3.0 took 12.75 s

Diarization on v3.1 took 22.36 s

System Information

This is on a Macbook Pro m1, running on the CPU. Torch 2.1.2

hbredin · 2024-01-18T08:13:27Z

Would you mind sharing a Google Colab that I can just click and run?

a-rogalska · 2024-01-18T09:39:46Z

Here is a colab link

For a 2 minute audio it took me 115.84s on v3.0.1 and 559.31s on v.3.1.0.

hbredin · 2024-01-19T14:53:18Z

Thanks for taking the time to prepare a notebook. That helps.

Looks like you did not provide the sample audio file so one cannot reproduce the example. You could share it (or another one) online and use !wget url_to_that_file.wav directly in the notebook to make it self-contained.
To get a better idea of where time is spent, you can wrap the call to the pipeline with a progress hook, or a timing hook, or both

from pyannote.audio.pipelines.utils.hook import Hook, ProgressHook, TimingHook
file = {"audio": ...}"

# progress hook alone (will show progress bar)
with ProgressHook() as hook:
    diarization = pipeline(file, hook=hook)

# timing hook alone (will add a "timing" key in file)
with TimingHook() as hook:
    diarization = pipeline(file, hook=hook)

# both
with Hook(ProgressHook(), TimingHook()) as hook:
    diarization = pipeline(file, hook=hook)

a-rogalska · 2024-01-19T15:39:12Z

Thanks for the hint, I updated the notebook with the sample audio from tutorials and hooks. According to them, embedding step takes much longer in the new version.

hbredin · 2024-01-23T07:48:12Z

Thanks to the completed MRE, I can now reproduce the issue.

The main difference between 3.0 and 3.1 is the switch from ONNX to pytorch inference.

On GPU: pytorch is faster than ONNX.
On CPU: ONNX is faster than pytorch.

Could anyone using pyannote in production on CPU chime in?
Any idea on how to make pytorch CPU inference faster?
I'd like to avoid going back to ONNX as it was apparently painful for GPU users.

askiefer · 2024-01-25T02:22:23Z

Any update on this @hbredin 🙏 ?

hbredin · 2024-01-25T08:16:18Z

No update... hence the help wanted tag ;-)
Hopefully one of the many users of pyannote will chime in.

mengjie-du · 2024-01-31T03:26:24Z

It has been noticed that the 3.1 pipeline efficiency suffers from speaker embedding inference. With the default config, every 10s chunk has to undergo inference 3 times by the embedding model. It proves effective by separating the whole embedding model pipeline into the resnet backbone and the mask pooling. With this modification, every chunk only needs to be inferred one time through the backbone, bringing almost 3x speedup in my experiment. Furthermore, cache inference strategy helps a lot as well, given the default overlapped ratio of 90%.

marrrcin · 2024-02-14T00:00:34Z

I think that the main problem lies in the

pyannote-audio/pyannote/audio/pipelines/speaker_diarization.py

Lines 302 to 306 in 6e22f41

    
           waveform, _ = self._audio.crop( 
        
               file, 
        
               chunk, 
        
               duration=duration, 
        
               mode="pad",

It seems like for longer files, the .crop call is taking much longer than embedding of the chunk (no matter whether it's CPU or CUDA). The easiest way to reproduce it is just to use 3.1.1 version with a wav file that is ~1h long. It's basically unusable for long audio files.

hbredin · 2024-02-14T07:56:12Z

@marrrcin these are two different problems.
Your problem can be solved by loading the file in memory first.

marrrcin · 2024-02-20T09:18:41Z

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

hbredin · 2024-02-20T09:42:02Z

Happy that your problem is solved and that you "tolerate" the performance of pyannote (that you use for free, by the way).

kenplusplus · 2024-04-07T07:25:53Z

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

Thanks @marrrcin 's sharing, Have you tested on CPU?

marrrcin · 2024-04-07T11:02:29Z

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

Thanks @marrrcin 's sharing, Have you tested on CPU?

No, I was running it on a GPU.

kenplusplus · 2024-04-08T03:02:02Z

I have tested with "Diarization pipeline v3.0" by using CPU, and also found its latency is less than v3.1 (50s -> 30s)

JuergenFleiss · 2024-04-28T16:25:19Z

Just to chime in with a comparison for CPU between 3.0 and 3.1. No loading the file into memory.

The difference is massive for longer files. A 22 minute file on a Ryzen 6850U.

27 minutes for the embeddings in 3.1
2 minutes 40 seconds for the embeddings in 3.0

We observed similar long embedding times on M1 and Intel.

JuergenFleiss · 2024-05-08T12:49:15Z

Just to chime in with a comparison for CPU between 3.0 and 3.1. No loading the file into memory.

The difference is massive for longer files. A 22 minute file on a Ryzen 6850U.
* 27 minutes for the embeddings in 3.1

* 2 minutes 40 seconds for the embeddings in 3.0
We observed similar long embedding times on M1 and Intel.

@hbredin Just tried out pyannote 1.2 and empbeddings are much faster again in CPU. Did you change somethin in this regard?

Again a 22 minute file on a Ryzen 6850U.
*1 minute, 48 seconds in 3.2
* 27 minutes for the embeddings in 3.1
* 2 minutes 40 seconds for the embeddings in 3.0

hbredin · 2024-05-08T12:59:14Z

I did not. But happy that problem is solved.

JuergenFleiss · 2024-05-08T13:14:41Z

I did not. But happy that problem is solved.

maybe it was the torch update...

hbredin added the cannot_reproduce label Jan 17, 2024

hbredin removed the cannot_reproduce label Jan 23, 2024

hbredin mentioned this issue Jan 23, 2024

Pyannote.audio 3.1.1 and speaker-diarization 3.1 slower than 3.0 on CPU #1626

Closed

hbredin added the help wanted label Jan 23, 2024

hbredin mentioned this issue Jan 31, 2024

ResNet backbone vs. mask pooling #1634

Open

JuergenFleiss mentioned this issue Apr 26, 2024

Embeddings takes 3x the length of the audio length #1687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

a-rogalska commented Jan 17, 2024 •

edited

ljnicol commented Jan 18, 2024 •

edited

hbredin commented Jan 18, 2024

a-rogalska commented Jan 18, 2024

hbredin commented Jan 19, 2024

a-rogalska commented Jan 19, 2024

hbredin commented Jan 23, 2024

askiefer commented Jan 25, 2024

hbredin commented Jan 25, 2024

mengjie-du commented Jan 31, 2024

marrrcin commented Feb 14, 2024

hbredin commented Feb 14, 2024

marrrcin commented Feb 20, 2024

hbredin commented Feb 20, 2024

kenplusplus commented Apr 7, 2024 •

edited

marrrcin commented Apr 7, 2024

kenplusplus commented Apr 8, 2024

JuergenFleiss commented Apr 28, 2024

JuergenFleiss commented May 8, 2024

hbredin commented May 8, 2024

JuergenFleiss commented May 8, 2024

Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

Comments

a-rogalska commented Jan 17, 2024 • edited

Tested versions

System information

Issue description

Minimal reproduction example (MRE)

ljnicol commented Jan 18, 2024 • edited

System Information

hbredin commented Jan 18, 2024

a-rogalska commented Jan 18, 2024

hbredin commented Jan 19, 2024

a-rogalska commented Jan 19, 2024

hbredin commented Jan 23, 2024

askiefer commented Jan 25, 2024

hbredin commented Jan 25, 2024

mengjie-du commented Jan 31, 2024

marrrcin commented Feb 14, 2024

hbredin commented Feb 14, 2024

marrrcin commented Feb 20, 2024

hbredin commented Feb 20, 2024

kenplusplus commented Apr 7, 2024 • edited

marrrcin commented Apr 7, 2024

kenplusplus commented Apr 8, 2024

JuergenFleiss commented Apr 28, 2024

JuergenFleiss commented May 8, 2024

hbredin commented May 8, 2024

JuergenFleiss commented May 8, 2024

a-rogalska commented Jan 17, 2024 •

edited

ljnicol commented Jan 18, 2024 •

edited

kenplusplus commented Apr 7, 2024 •

edited