Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add balance_weights to weight balanced batches #1588

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
56 changes: 56 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Bug report
description: Report a bug in pyannote.audio
body:

- type: markdown
attributes:
value: |
When reporting bugs, please follow the guidelines in this template. This helps identify the problem precisely and thus enables contributors to fix it faster.
- Write a descriptive issue title above.
- The golden rule is to **always open *one* issue for *one* bug**. If you notice several bugs and want to report them, make sure to create one new issue for each of them.
- Search [open](https://github.com/pyannote/pyannote-audio/issues) and [closed](https://github.com/pyannote/pyannote-audio/issues?q=is%3Aissue+is%3Aclosed) issues to ensure it has not already been reported. If you don't find a relevant match or if you're unsure, don't hesitate to **open a new issue**. The bugsquad will handle it from there if it's a duplicate.
- Please always check if your issue is reproducible in the latest version – it may already have been fixed!
- If you use a custom build, please test if your issue is reproducible in official releases too.

- type: textarea
attributes:
label: Tested versions
description: |
To properly fix a bug, we need to identify if the bug was recently introduced in the engine, or if it was always present.
- Please specify the pyannote.audio version you found the issue in, including the **Git commit hash** if using a development build.
- If you can, **please test earlier pyannote.audio versions** and, if applicable, newer versions (development branch). Mention whether the bug is reproducible or not in the versions you tested.
- The aim is for us to identify whether a bug is a **regression**, i.e. an issue that didn't exist in a previous version, but was introduced later on, breaking existing functionality. For example, if a bug is reproducible in 3.2 but not in 3.0, we would like you to test intermediate 3.1 to find which version is the first one where the issue can be reproduced.
placeholder: |
- Reproducible in: 3.1, 3.2, and later
- Not reproducible in: 3.0
validations:
required: true

- type: input
attributes:
label: System information
description: |
- Specify the OS version, and when relevant hardware information.
- For issues that are likely OS-specific and/or GPU-related, please specify the GPU model and architecture.
- **Bug reports not including the required information may be closed at the maintainers' discretion.** If in doubt, always include all the requested information; it's better to include too much information than not enough information.
placeholder: macOS 13.6 - pyannote.audio 3.1.1 - M1 Pro
validations:
required: true

- type: textarea
attributes:
label: Issue description
description: |
Describe your issue briefly. What doesn't work, and how do you expect it to work instead?
You can include audio, images or videos with drag and drop, and format code blocks or logs with <code>```</code> tags.
validations:
required: true

- type: input
attributes:
label: Minimal reproduction example (MRE)
description: |
Having reproducible issues is a prerequisite for contributors to be able to solve them.
Include a link to minimal reproduction example using [this Google Colab notebook](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/MRE_template.ipynb) as a starting point.
validations:
required: true
15 changes: 15 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
blank_issues_enabled: false

contact_links:

- name: Feature request
url: https://github.com/pyannote/pyannote-audio/discussions
about: Suggest an idea for this project.

- name: Consulting
url: https://herve.niderb.fr/consulting
about: Using pyannote.audio in production? Make the most of it thanks to our consulting services.

- name: Premium models
url: https://forms.gle/eKhn7H2zTa68sMMx8
about: We are considering selling premium models, extensions, or services around pyannote.audio.
20 changes: 0 additions & 20 deletions .github/ISSUE_TEMPLATE/feature_request.md

This file was deleted.

29 changes: 0 additions & 29 deletions .github/workflows/new_issue.yml

This file was deleted.

2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ repos:

# Sort imports
- repo: https://github.com/PyCQA/isort
rev: 5.10.1
rev: 5.12.0
hooks:
- id: isort
args: ["--profile", "black"]
Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# Changelog

## develop

### New features

- feat(pipeline): add `Waveform` and `SampleRate` preprocessors
- feat(model): add `num_frames` and `receptive_field` to segmentation models

### Fixes

- fix(task): fix random generators

## Version 3.1.1 (2023-12-01)

### TL;DR
Expand Down
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,26 +70,28 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
- Videos
- [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
- [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
- [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min
- [First release of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min

## Benchmark

Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
Those numbers are diarization error rates (in %):

| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.gle/eKhn7H2zTa68sMMx8) |
| ---------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ---------------------------------------------- |
| AISHELL-4 | 14.1 | 12.3 | 11.9 |
| AliMeeting (channel 1) | 27.4 | 24.5 | 22.5 |
| AMI (IHM) | 18.9 | 18.8 | 16.6 |
| AMI (SDM) | 27.1 | 22.6 | 20.9 |
| AVA-AVD | 66.3 | 50.0 | 39.8 |
| CALLHOME (part 2) | 31.6 | 28.4 | 22.2 |
| DIHARD 3 (full) | 26.9 | 21.4 | 17.2 |
| Ego4D (dev.) | 61.5 | 51.2 | 43.8 |
| MSDWild | 32.8 | 25.4 | 19.8 |
| REPERE (phase2) | 8.2 | 7.8 | 7.6 |
| VoxConverse (v0.3) | 11.2 | 11.2 | 9.4 |
| ---------------------- | ------ | ------ | --------- |
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 14.1 | 12.2 | 11.9 |
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 27.4 | 24.4 | 22.5 |
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.9 | 18.8 | 16.6 |
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 27.1 | 22.4 | 20.9 |
| [AVA-AVD](https://arxiv.org/abs/2111.14448) | 66.3 | 50.0 | 39.8 |
| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 31.6 | 28.4 | 22.2 |
| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 26.9 | 21.7 | 17.2 |
| [Earnings21](https://github.com/revdotcom/speech-datasets) | 17.0 | 9.4 | 9.0 |
| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 61.5 | 51.2 | 43.8 |
| [MSDWild](https://github.com/X-LANCE/MSDWILD) | 32.8 | 25.3 | 19.8 |
| [RAMC](https://www.openslr.org/123/) | 22.5 | 22.2 | 18.4 |
| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 8.2 | 7.8 | 7.6 |
| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.3 | 9.4 |

[Diarization error rate](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %)

Expand Down
89 changes: 87 additions & 2 deletions pyannote/audio/models/blocks/sincnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,21 @@
import torch.nn as nn
import torch.nn.functional as F
from asteroid_filterbanks import Encoder, ParamSincFB
from pyannote.core import SlidingWindow

from pyannote.audio.utils.frame import conv1d_num_frames, conv1d_receptive_field_size


class SincNet(nn.Module):
def __init__(self, sample_rate: int = 16000, stride: int = 1):
super().__init__()

if sample_rate != 16000:
raise NotImplementedError("PyanNet only supports 16kHz audio for now.")
raise NotImplementedError("SincNet only supports 16kHz audio for now.")
# TODO: add support for other sample rate. it should be enough to multiply
# kernel_size by (sample_rate / 16000). but this needs to be double-checked.

self.sample_rate = sample_rate
self.stride = stride

self.wav_norm1d = nn.InstanceNorm1d(1, affine=True)
Expand Down Expand Up @@ -70,6 +74,88 @@ def __init__(self, sample_rate: int = 16000, stride: int = 1):
self.pool1d.append(nn.MaxPool1d(3, stride=3, padding=0, dilation=1))
self.norm1d.append(nn.InstanceNorm1d(60, affine=True))

def num_frames(self, num_samples: int) -> int:
"""Compute number of output frames for a given number of input samples

Parameters
----------
num_samples : int
Number of input samples

Returns
-------
num_frames : int
Number of output frames
"""

kernel_size = [251, 3, 5, 3, 5, 3]
stride = [self.stride, 3, 1, 3, 1, 3]
padding = [0, 0, 0, 0, 0, 0]
dilation = [1, 1, 1, 1, 1, 1]

num_frames = num_samples
for k, s, p, d in zip(kernel_size, stride, padding, dilation):
num_frames = conv1d_num_frames(
num_frames, kernel_size=k, stride=s, padding=p, dilation=d
)

return num_frames

def receptive_field_size(self, num_frames: int = 1) -> int:
"""Compute receptive field size

Parameters
----------
num_frames : int, optional
Number of frames in the output signal

Returns
-------
receptive_field_size : int
Receptive field size
"""

kernel_size = [251, 3, 5, 3, 5, 3]
stride = [self.stride, 3, 1, 3, 1, 3]
padding = [0, 0, 0, 0, 0, 0]
dilation = [1, 1, 1, 1, 1, 1]

receptive_field_size = num_frames
for k, s, p, d in reversed(list(zip(kernel_size, stride, padding, dilation))):
receptive_field_size = conv1d_receptive_field_size(
num_frames=receptive_field_size,
kernel_size=k,
stride=s,
padding=p,
dilation=d,
)

return receptive_field_size

def receptive_field(self) -> SlidingWindow:
"""Compute receptive field

Returns
-------
receptive field : SlidingWindow

Source
------
https://distill.pub/2019/computing-receptive-fields/

"""

# duration of the receptive field of each output frame
duration = self.receptive_field_size() / self.sample_rate

# step between the receptive field region of two consecutive output frames
step = (
self.receptive_field_size(num_frames=2)
- self.receptive_field_size(num_frames=1)
) / self.sample_rate

return SlidingWindow(start=0.0, duration=duration, step=step)

def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
"""Pass forward

Expand All @@ -83,7 +169,6 @@ def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
for c, (conv1d, pool1d, norm1d) in enumerate(
zip(self.conv1d, self.pool1d, self.norm1d)
):

outputs = conv1d(outputs)

# https://github.com/mravanelli/SincNet/issues/4
Expand Down
31 changes: 31 additions & 0 deletions pyannote/audio/models/segmentation/PyanNet.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from pyannote.core import SlidingWindow
from pyannote.core.utils.generators import pairwise

from pyannote.audio.core.model import Model
Expand Down Expand Up @@ -157,6 +158,36 @@ def build(self):
self.classifier = nn.Linear(in_features, out_features)
self.activation = self.default_activation()

def num_frames(self, num_samples: int) -> int:
"""Compute number of output frames for a given number of input samples

Parameters
----------
num_samples : int
Number of input samples

Returns
-------
num_frames : int
Number of output frames
"""

return self.sincnet.num_frames(num_samples)

def receptive_field(self) -> SlidingWindow:
"""Compute receptive field

Returns
-------
receptive field : SlidingWindow

Source
------
https://distill.pub/2019/computing-receptive-fields/

"""
return self.sincnet.receptive_field()

def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
"""Pass forward

Expand Down