Diarization Pipeline config on diarize.py #773

alejandrogranizo · 2024-04-09T01:51:50Z

Hi, im opening this issue since we are working from a place with connection restrictions. HuggingFace downloads falls into these kinds of restrictions, so the configuration of the DiarizationPipeline class is becoming a problem when trying to use the diarization feature of the library.

We are trying to run the following code in our project:

self.diarize_model = whisperx.DiarizationPipeline(model_name='pyannote/speaker-diarization-3.1', use_auth_token='OUR_VALID_TOKEN', device='cuda')

This is the recommended way to create the diarization pipeline to later start the diarization feature.
Issue comes because of the restrictions of the network. As the use of the Pipeline in the DiarizationPipeline class is using pyannote.audio Pipeline.from_pretrained method with arguments (model_name, auth_token) instead of giving a way of checking local resources first (such as providing the local path for the config.yml or config.yaml file as arg), it always uses the instantiation of pyannote.audio Pipeline in a way that gets to the line hf_hub_download, as the model name is never detected as a yml/yaml file because of the treatment made on DiarizationPipeline class. As the request does not go trough because of the network restrictions that are present in many territories, it delays the execution until multiple timeouts for the request occur, just to finally go to the option of searching for the local model that exists in the filesystem since the beggining of the execution.

Please is any solution available, when using it in a bigger project it is so annoying having to wait for the multiple timeouts to occur just to test and debug the project.

Thanks

The text was updated successfully, but these errors were encountered:

Hyprnx · 2024-04-11T11:17:20Z

hi, having the same problem with network restriction. if there are any solution, i would be interested to know. Thank you in advanced

GroovyDan · 2024-04-12T19:09:16Z

It is possible to download all the required models and reference them from a local file system. This article from AWS describes downloading all of the models to a local file system, which is similar to the approach I took. I was able to build a docker image that loads all the models from AWS S3 into the docker container during build and then reference all the models via their local path when running whipserx. Specifically for Diarization, where config.yaml is updated to reference local paths to the necessary models downloaded from HuggingFace:

print(">> Loading Diarization Pipeline")
diarize_model = whisperx.DiarizationPipeline(
model_name=os.path.join(MODEL_DIR, DIARIZATION_FOLDER, "config.yaml"),
device=DEVICE
)

Hyprnx · 2024-04-14T17:32:38Z

pipeline:
  name: pyannote.audio.pipelines.SpeakerDiarization
  params:
    clustering: AgglomerativeClustering
    embedding: pytorch_model_embedding.bin
    embedding_batch_size: 32
    embedding_exclude_overlap: true
    segmentation: pytorch_model_segmentation.bin
    segmentation_batch_size: 32

params:
  clustering:
    method: centroid
    min_cluster_size: 15
    threshold: 0.7153814381597874
  segmentation:
    min_duration_off: 0.5817029604921046
    threshold: 0.4442333667381752

device: cuda

@alejandrogranizo the above is my config.yaml. You can go to huggingface, sign the agreement with pyannote and download the respective model.bin file, and change the path to the model.bin file. Implemented and worked on my machine, both on cloud and on prem. DM me if you have any probs

Hyprnx · 2024-04-14T17:36:21Z

diarization_pipeline = whisperx.DiarizationPipeline(<config.yaml>)

initialize and

result = diarization_pipeline(your_audiofile_path)

this should work

Dmitriuso · 2024-04-21T09:08:24Z

@Hyprnx which version of pyannote.audio are you using? I got an error with pyannote.audio=3.1.1 : "threshold parameter doesn't exist".

Hyprnx · 2024-04-21T09:15:20Z

@Dmitriuso the pyannote audio i use comes with WhisperX when i install it. i didnt install it separately.

pip install git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560

this should work

nkilm · 2024-05-26T18:23:53Z

If you are looking for a way to run whisperx completely offline, I have a script for that,

Repo - https://github.com/nkilm/offline-whisperx

You have to manually download the models and then specify the paths in the script. The script works 100% locally without internet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diarization Pipeline config on diarize.py #773

Diarization Pipeline config on diarize.py #773

alejandrogranizo commented Apr 9, 2024

Hyprnx commented Apr 11, 2024

GroovyDan commented Apr 12, 2024

Hyprnx commented Apr 14, 2024

Hyprnx commented Apr 14, 2024

Dmitriuso commented Apr 21, 2024

Hyprnx commented Apr 21, 2024

nkilm commented May 26, 2024

Diarization Pipeline config on diarize.py #773

Diarization Pipeline config on diarize.py #773

Comments

alejandrogranizo commented Apr 9, 2024

Hyprnx commented Apr 11, 2024

GroovyDan commented Apr 12, 2024

Hyprnx commented Apr 14, 2024

Hyprnx commented Apr 14, 2024

Dmitriuso commented Apr 21, 2024

Hyprnx commented Apr 21, 2024

nkilm commented May 26, 2024