pyannote · FrenchKrab · Dec 11, 2023 · Dec 14, 2023 · Dec 19, 2023 · Dec 19, 2023
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -0,0 +1,56 @@
+name: Bug report
+description: Report a bug in pyannote.audio
+body:
+
+- type: markdown
+  attributes:
+    value: |
+      When reporting bugs, please follow the guidelines in this template. This helps identify the problem precisely and thus enables contributors to fix it faster.
+      - Write a descriptive issue title above.
+      - The golden rule is to **always open *one* issue for *one* bug**. If you notice several bugs and want to report them, make sure to create one new issue for each of them.
+      - Search [open](https://github.com/pyannote/pyannote-audio/issues) and [closed](https://github.com/pyannote/pyannote-audio/issues?q=is%3Aissue+is%3Aclosed) issues to ensure it has not already been reported. If you don't find a relevant match or if you're unsure, don't hesitate to **open a new issue**. The bugsquad will handle it from there if it's a duplicate.
+      - Please always check if your issue is reproducible in the latest version – it may already have been fixed!
+      - If you use a custom build, please test if your issue is reproducible in official releases too.
+
+- type: textarea
+  attributes:
+    label: Tested versions
+    description: |
+      To properly fix a bug, we need to identify if the bug was recently introduced in the engine, or if it was always present.
+      - Please specify the pyannote.audio version you found the issue in, including the **Git commit hash** if using a development build.
+      - If you can, **please test earlier pyannote.audio versions** and, if applicable, newer versions (development branch). Mention whether the bug is reproducible or not in the versions you tested.
+      - The aim is for us to identify whether a bug is a **regression**, i.e. an issue that didn't exist in a previous version, but was introduced later on, breaking existing functionality. For example, if a bug is reproducible in 3.2 but not in 3.0, we would like you to test intermediate 3.1 to find which version is the first one where the issue can be reproduced.
+    placeholder: |
+      - Reproducible in: 3.1, 3.2, and later
+      - Not reproducible in: 3.0
+  validations:
+    required: true
+
+- type: input
+  attributes:
+    label: System information
+    description: |
+      - Specify the OS version, and when relevant hardware information.
+      - For issues that are likely OS-specific and/or GPU-related, please specify the GPU model and architecture.
+      - **Bug reports not including the required information may be closed at the maintainers' discretion.** If in doubt, always include all the requested information; it's better to include too much information than not enough information.
+    placeholder: macOS 13.6 - pyannote.audio 3.1.1 - M1 Pro
+  validations:
+    required: true
+
+- type: textarea
+  attributes:
+    label: Issue description
+    description: |
+      Describe your issue briefly. What doesn't work, and how do you expect it to work instead?
+      You can include audio, images or videos with drag and drop, and format code blocks or logs with <code>```</code> tags.
+  validations:
+    required: true
+
+- type: input
+  attributes:
+    label: Minimal reproduction example (MRE)
+    description: |
+      Having reproducible issues is a prerequisite for contributors to be able to solve them.
+      Include a link to minimal reproduction example using [this Google Colab notebook](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/MRE_template.ipynb) as a starting point.
+  validations:
+    required: true
diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -0,0 +1,15 @@
+blank_issues_enabled: false
+
+contact_links:
+
+  - name: Feature request
+    url: https://github.com/pyannote/pyannote-audio/discussions
+    about: Suggest an idea for this project.
+
+  - name: Consulting
+    url: https://herve.niderb.fr/consulting
+    about: Using pyannote.audio in production? Make the most of it thanks to our consulting services.
+
+  - name: Premium models
+    url: https://forms.gle/eKhn7H2zTa68sMMx8
+    about: We are considering selling premium models, extensions, or services around pyannote.audio.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
diff --git a/.github/workflows/new_issue.yml b/.github/workflows/new_issue.yml
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -14,7 +14,7 @@ repos:
 
     # Sort imports
     - repo: https://github.com/PyCQA/isort
-      rev: 5.10.1
+      rev: 5.12.0
       hooks:
       - id: isort
         args: ["--profile", "black"]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,16 @@
 # Changelog
 
+## develop
+
+### New features
+
+- feat(pipeline): add `Waveform` and `SampleRate` preprocessors
+- feat(model): add `num_frames` and `receptive_field` to segmentation models
+
+### Fixes
+
+- fix(task): fix random generators
+
 ## Version 3.1.1 (2023-12-01)
 
 ### TL;DR

diff --git a/README.md b/README.md
@@ -70,26 +70,28 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
 - Videos
   - [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
   - [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
-  - [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min
+  - [First release of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min
 
 ## Benchmark
 
 Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
 Those numbers are diarization error rates (in %):
 
 | Benchmark              | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.gle/eKhn7H2zTa68sMMx8) |
-| ---------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ---------------------------------------------- |
-| AISHELL-4              | 14.1                                                   | 12.3                                                   | 11.9                                           |
-| AliMeeting (channel 1) | 27.4                                                   | 24.5                                                   | 22.5                                           |
-| AMI (IHM)              | 18.9                                                   | 18.8                                                   | 16.6                                           |
-| AMI (SDM)              | 27.1                                                   | 22.6                                                   | 20.9                                           |
-| AVA-AVD                | 66.3                                                   | 50.0                                                   | 39.8                                           |
-| CALLHOME (part 2)      | 31.6                                                   | 28.4                                                   | 22.2                                           |
-| DIHARD 3 (full)        | 26.9                                                   | 21.4                                                   | 17.2                                           |
-| Ego4D (dev.)           | 61.5                                                   | 51.2                                                   | 43.8                                           |
-| MSDWild                | 32.8                                                   | 25.4                                                   | 19.8                                           |
-| REPERE (phase2)        | 8.2                                                    | 7.8                                                    | 7.6                                            |
-| VoxConverse (v0.3)     | 11.2                                                   | 11.2                                                   | 9.4                                            |
+| ---------------------- | ------ | ------ | --------- |
+| [AISHELL-4](https://arxiv.org/abs/2104.03603)              |  14.1  |  12.2  | 11.9      |
+| [AliMeeting](https://www.openslr.org/119/) (channel 1) |  27.4  |  24.4  | 22.5      |
+| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM)              |  18.9  |  18.8  | 16.6      |
+| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM)              |  27.1  |  22.4  | 20.9      |
+| [AVA-AVD](https://arxiv.org/abs/2111.14448)                |  66.3  |  50.0  | 39.8      |
+| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1))      |  31.6  |  28.4  | 22.2      |
+| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477))        |  26.9  |  21.7  | 17.2      |
+| [Earnings21](https://github.com/revdotcom/speech-datasets)   | 17.0 | 9.4 | 9.0 |
+| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.)           |  61.5  |  51.2  | 43.8      |
+| [MSDWild](https://github.com/X-LANCE/MSDWILD)                |  32.8  |  25.3  | 19.8      |
+| [RAMC](https://www.openslr.org/123/)                   |  22.5  |  22.2  | 18.4      |
+| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2)        |   8.2  |   7.8  |  7.6      |
+| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3)     |  11.2  |  11.3  |  9.4      |
 
 [Diarization error rate](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %)
 

diff --git a/pyannote/audio/models/blocks/sincnet.py b/pyannote/audio/models/blocks/sincnet.py
@@ -28,17 +28,21 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from asteroid_filterbanks import Encoder, ParamSincFB
+from pyannote.core import SlidingWindow
+
+from pyannote.audio.utils.frame import conv1d_num_frames, conv1d_receptive_field_size
 
 
 class SincNet(nn.Module):
     def __init__(self, sample_rate: int = 16000, stride: int = 1):
         super().__init__()
 
         if sample_rate != 16000:
-            raise NotImplementedError("PyanNet only supports 16kHz audio for now.")
+            raise NotImplementedError("SincNet only supports 16kHz audio for now.")
             # TODO: add support for other sample rate. it should be enough to multiply
             # kernel_size by (sample_rate / 16000). but this needs to be double-checked.
 
+        self.sample_rate = sample_rate
         self.stride = stride
 
         self.wav_norm1d = nn.InstanceNorm1d(1, affine=True)
@@ -70,6 +74,88 @@ def __init__(self, sample_rate: int = 16000, stride: int = 1):
         self.pool1d.append(nn.MaxPool1d(3, stride=3, padding=0, dilation=1))
         self.norm1d.append(nn.InstanceNorm1d(60, affine=True))
 
+    def num_frames(self, num_samples: int) -> int:
+        """Compute number of output frames for a given number of input samples
+
+        Parameters
+        ----------
+        num_samples : int
+            Number of input samples
+
+        Returns
+        -------
+        num_frames : int
+            Number of output frames
+        """
+
+        kernel_size = [251, 3, 5, 3, 5, 3]
+        stride = [self.stride, 3, 1, 3, 1, 3]
+        padding = [0, 0, 0, 0, 0, 0]
+        dilation = [1, 1, 1, 1, 1, 1]
+
+        num_frames = num_samples
+        for k, s, p, d in zip(kernel_size, stride, padding, dilation):
+            num_frames = conv1d_num_frames(
+                num_frames, kernel_size=k, stride=s, padding=p, dilation=d
+            )
+
+        return num_frames
+
+    def receptive_field_size(self, num_frames: int = 1) -> int:
+        """Compute receptive field size
+
+        Parameters
+        ----------
+        num_frames : int, optional
+            Number of frames in the output signal
+
+        Returns
+        -------
+        receptive_field_size : int
+            Receptive field size
+        """
+
+        kernel_size = [251, 3, 5, 3, 5, 3]
+        stride = [self.stride, 3, 1, 3, 1, 3]
+        padding = [0, 0, 0, 0, 0, 0]
+        dilation = [1, 1, 1, 1, 1, 1]
+
+        receptive_field_size = num_frames
+        for k, s, p, d in reversed(list(zip(kernel_size, stride, padding, dilation))):
+            receptive_field_size = conv1d_receptive_field_size(
+                num_frames=receptive_field_size,
+                kernel_size=k,
+                stride=s,
+                padding=p,
+                dilation=d,
+            )
+
+        return receptive_field_size
+
+    def receptive_field(self) -> SlidingWindow:
+        """Compute receptive field
+
+        Returns
+        -------
+        receptive field : SlidingWindow
+
+        Source
+        ------
+        https://distill.pub/2019/computing-receptive-fields/
+
+        """
+
+        # duration of the receptive field of each output frame
+        duration = self.receptive_field_size() / self.sample_rate
+
+        # step between the receptive field region of two consecutive output frames
+        step = (
+            self.receptive_field_size(num_frames=2)
+            - self.receptive_field_size(num_frames=1)
+        ) / self.sample_rate
+
+        return SlidingWindow(start=0.0, duration=duration, step=step)
+
     def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
         """Pass forward
 
@@ -83,7 +169,6 @@ def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
         for c, (conv1d, pool1d, norm1d) in enumerate(
             zip(self.conv1d, self.pool1d, self.norm1d)
         ):
-
             outputs = conv1d(outputs)
 
             # https://github.com/mravanelli/SincNet/issues/4

diff --git a/pyannote/audio/models/segmentation/PyanNet.py b/pyannote/audio/models/segmentation/PyanNet.py
@@ -27,6 +27,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from einops import rearrange
+from pyannote.core import SlidingWindow
 from pyannote.core.utils.generators import pairwise
 
 from pyannote.audio.core.model import Model
@@ -157,6 +158,36 @@ def build(self):
         self.classifier = nn.Linear(in_features, out_features)
         self.activation = self.default_activation()
 
+    def num_frames(self, num_samples: int) -> int:
+        """Compute number of output frames for a given number of input samples
+
+        Parameters
+        ----------
+        num_samples : int
+            Number of input samples
+
+        Returns
+        -------
+        num_frames : int
+            Number of output frames
+        """
+
+        return self.sincnet.num_frames(num_samples)
+
+    def receptive_field(self) -> SlidingWindow:
+        """Compute receptive field
+
+        Returns
+        -------
+        receptive field : SlidingWindow
+
+        Source
+        ------
+        https://distill.pub/2019/computing-receptive-fields/
+
+        """
+        return self.sincnet.receptive_field()
+
     def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
         """Pass forward