Local Spectral Attention for Full-band Speech Enhancement

This repository conduct ablation studies on local attention (a.k.a band attention) applied in full-band spectrum, namely local spectral attention (LSA). Two full-band speech enhancement (SE) models with spectral attention replace the conventional attention (a global manner) with LSA that only looks at adjacent bands at a certain frequency (a local manner). One model is DPARN, whose source code can be found in https://github.com/Qinwen-Hu/dparn.
The other model is the Multi-Scale Temporal Frequency with Axial Attention (MTFAA) network, which ranked 1st in the DNS-4 challenge for full-band SE, and its detailed description can be found in paper https://ieeexplore.ieee.org/document/9746610. Here we release an unofficial pytorch implementation of MTFAA as well as its modification. This work have been submitted to Interspeech2023.

Rquirements

soundfile: 0.10.3
librosa: 0.8.1
torch: 3.7.10
numpy: 1.20.3
scipy: 1.7.2
pandas: 1.3.4
tqdm: 4.62.3

Network training

Data preparation

Split your speech and noise audios into 10 seconds segments and generate the .csv files to manage your data. Prepare your RIR audios of .wav format in one folder. Edit the .csv path in Dataloader.py:

   TRAIN_NOISE_CSV = './train_noise_data.csv'  
   VALID_CLEAN_CSV = './valid_clean_data.csv'  
   VALID_NOISE_CSV = './valid_noise_data.csv'  
   RIR_DIR = 'direction to RIR .wav audios'

where the .csv files for clean speech are organized as

file_dir	snr
./clean_0001.wav	4
./clean_0002.wav	-1
./clean_0003.wav	0
...	...

and the .csv files for noise are organized as

file_dir
./noise_0001.wav
./noise_0002.wav
./noise_0003.wav
...

the 'file_dir' and 'snr' denote the absolute direction to audios and signal-to-noise ratio(SNR) respectively.

Start training

After environment and data preparation, start to train the model by command:

python Network_Training_MTFAA_full.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c Dir_to_save_the_checkpoint_files -e Epochs_for_training(default is 300) -d Device_used_for_training(cuda:0)

Inference

Enhance noisy audios by command:

python Infer.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c path_to_load_the_checkpoint_files -t path_to_folder_containing_noisy_audios -s path_to_folder_saving_the_enhanced_clips -d Device_used_for_training(cuda:0)

Ablation study and experiment results

We demonstrate the effectiveness of our proposed method on the full-band dataset of the 4th DNS challenge. The total training dataset contains around 1000 hours of speech and 220 hours of noise. Room impulse responses are convolved with clean speech to generate simulated reverberant speech, which is preserved as training target. In the training stage, reverberant utterances are mixed with noise recordings with SNR ranging from -5 dB to 5 dB at1 dB intervals. For the test set, 800 clips of reverberant utterances are mixed with unseen noise types with SNR ranging from -5 dB to 15 dB. Each test clip is 5 seconds long. All utterances are sampled at 48 kHz in our experiments. We also conduct experiments on well-known VCTK-DEMAND dataset for comprehensive validation.

LSA on MTFAA and DPARN

The visualization of LSA mechanism can be seen in the figure below:

The unofficial Pytorch implementation of MTFAA and its LSA-based model can be seen in MTFAA_Net_full.py and MTFAA_Net_full_local_atten.py respectively. As for DPARN, readers may attend to https://github.com/Qinwen-Hu/dparn.
Firstly, we conduct experiments on different setting of N_l based on the VCTK-DEMAND dataset and the results can be seen in table below:

Config.		Wideband Metrics				Full-band Metrics
Model	N_l	PESQ	CSIG	CBAK	COVL	STOI(%)	SiSDR(dB)
MTFAA	F’/2	3.16	4.34	3.63	3.77	94.7	18.5
	F’/4	3.15	4.32	3.58	3.76	94.6	18.1
	Sqrt(F‘)	3.16	4.35	3.61	3.78	94.7	18.8
DPARN	F’/2	2.96	4.29	3.63	3.68	94.2	18.7
	F’/4	2.95	4.27	3.65	3.68	94.2	18.8
	Sqrt(F‘)	2.94	4.27	3.62	3.67	94.1	18.5

It can be seen that the setting of N_l affects different models differently and we choose the setting achieving the best performance for each model, i.e. sqrt(F') for MTFAA and F'/2 for DPARN. Next, we train the models with the larger DNS4 dataset and the training process can be seen in figures below, where both LSA-based models achieve better convergence compared with the original models.

The objective test results can be seen in table below

Full-band Metrics	STOI			SiSDR (dB)				LSD (dB)
SNR(dB)	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.		Band(kHz)		0~8	8~24	Full.
Noisy	0.687	0.805	0.771	-2.515	7.971	5.166		Noisy		18.37	12.38	14.38
MTFAA	0.805	0.876	0.856	10.10	15.74	14.23		MTFAA		10.33	9.349	9.678
MTFAA-LSA	0.809	0.881	0.860	10.34	16.20	14.63		MTFAA-LSA		9.840	8.636	9.037
DPARN	0.752	0.858	0.828	8.461	13.71	12.31		DPARN		10.92	13.11	12.38
DPARN-LSA	0.757	0.861	0.831	8.617	13.84	12.47		DPARN-LSA		10.76	12.99	12.25

Wideband Metrics	PESQ			CSIG			CBAK			COVL
SNR(dB)	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.
Noisy	1.160	1.446	1.364	2.023	2.719	2.517	1.833	2.481	2.293	1.571	2.095	1.943
MTFAA	1.981	2.669	2.470	3.465	4.113	3.925	2.951	3.523	3.357	2.754	3.436	3.238
MTFAA-LSA	2.084	2.795	2.589	3.517	4.203	4.004	3.006	3.593	3.423	2.829	3.547	3.339
DPARN	1.702	2.309	2.134	3.136	3.759	3.580	2.505	2.859	2.757	2.447	3.069	2.890
DPARN-LSA	1.776	2.423	2.237	3.179	3.829	3.642	2.619	3.030	2.912	2.507	3.166	2.977

The proposed LSA improves the enhancement performance of both the casual DPARN and MTFAA models in terms of all objective metrics. To reveal the benefit of LSA mechanism, we visualize the normalized average spectral attention plots, generated from audios in the test set, of attention blocks in both original MTFAA and LSA-based MTFAA, as shown in figures below

It can be seen from the fifth layer of attention that the LSA-based model more effectively emphasizes the structural features of harmonics in low bands (marked with red boxes) and the almost randomly distributed components in high bands (marked with black boxes). Furthermore, it can be seen from the blue boxes that LSA can also effectively alleviate the modeling of the invalid correlation between the low bands and the high bands. Hence, the speech pattern in spectrum can be better modeled by LSA. Further investigation of the enhanced signals reveals that the global attention in frequency domain is more likely to inflict distortion to speech components or produce excessive residual noise in non-speech segments, while this problem can be effectively alleviated by the proposed LSA. Two typical examples are shown in Figure 3, where the benefit of LSA can be clearly seen. A possible explanation is that the better exploitation to speech pattern helps LSA-based model more effectively discriminate speech and noise components especially in low-SNR environments.

To further demonstrate the importance of modeling local correlation in spectrum for full-band SE tasks, we also compare local attention with a recently proposed biased attention method, namely Attention with Linear Biases (ALiBi), which negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query for efficient extrapolation. Its application on spectral attention can be seen in the figure below

We modify the penalty bias to decrease in a square manner for better performance and name the method as ASqBi, indicated in the figures below.

The modified method is combined with MTFAA MTFAA_Net_full_F_ASqbi.py and the ablation test results can be found in the last row of table below. It can be seen that the overall performance degrades compared with LSA. It may be explained that the negative bias added to local attention region weakens the model capability to extract local intercorrelation.

Full-band Metrics

STOI

SiSDR(dB)

LSD(dB)

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Band(kHz)

0~8

8~24

Full.

MTFAA-ASqBi

0.811

0.881

0.860

10.425

15.944

14.468

MTFAA-ASqaBi

10.307

9.495

9.766

MTFAA-LSA

0.809

0.881

0.860

10.347

16.201

14.635

MTFAA-LSA

9.840

8.636

9.037

Wideband Metrics

PESQ

CSIG

CBAK

COVL

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

MTFAA-ASqBi

2.064

2.769

2.564

3.487

4.165

3.968

2.987

3.558

3.392

2.804

3.513

3.308

MTFAA-LSA

2.084

2.795

2.589

3.517

4.203

4.004

3.006

3.593

3.423

2.829

3.547

3.339

We also conduct subjective listening preference test on MTFAA model to validate the benifit of LSA mechanism. 50 enhanced samples as well as their reference target speech are randomly selected from the test set. 15 listeners with normal hearing compare the enhanced results based on the reference speech and choose the preferred result. Each sample is evaluated by at least 3 listeners. The subjective listening preference test results can be seen in table below. Over 60% samples enhanced by LSA-based MTFAA are considered to have better perceptual quality and lower noise levels, which demonstrates the efficiency of LSA in full-band SE tasks.

Model	LSA	Preference (%)
MTFAA	Ⅹ	38.0
MTFAA	√	62.0

The proposed method also reduces computational complexity in spectral attention and the statistics are given in Table below

Model

Percentage of complexity reduction

in spectral attention (%)

MTFAA

63.2

DPARN

25.4

We compare the modified MTFAA model with previous full-band SOTA methods on VCTK-DEMAND dataset and the results are listed in table below

Models	Year	Param.(M)	PESQ	STOI(%)	CSIG	CBAK	COVL
Noisy	-	-	1.97	92.1	3.34	2.44	2.63
RNNoise	2020	0.06	2.33	92.2	3.40	2.51	2.84
PercepNet	2020	8.00	2.73	-	-	-	-
CTS-Net(full)	2020	7.09	2.92	94.3	4.22	3.43	3.62
DCCRN	2020	3.70	2.54	93.8	3.74	3.13	2.75
NSNet2	2021	6.17	2.47	90.3	3.23	2.99	2.90
S-DCCRN	2022	2.34	2.84	94.0	4.03	3.43	2.97
FullSubNet+	2022	8.67	2.88	94.0	3.86	3.42	3.57
GaGNet	2022	5.95	2.94	-	4.26	3.45	3.59
DMF-Net	2022	7.84	2.97	94.4	4.26	3.52	3.62
DS-Net	2022	3.30	2.78	94.3	4.20	3.34	3.48
SF-Net	2022	6.98	3.02	94.5	4.36	3.54	3.67
DeepFilterNet2	2022	2.31	3.08	94.3	4.30	3.40	3.70
MTFAA (Cau., LSA)	2023	1.5	3.16	94.7	4.35	3.61	3.78
MTFAA (Non-cau., LSA)	2023	1.5	3.30	95.3	4.45	3.73	3.90

To further investigate the patterns of spectral attention, we firstly plot the attention figures generated from clean speech of male and female,

it can be seen that the patterns are related to speech characteristics that harmonics are basically distributed in low bands (top-left corner of attention plot) and consonants (almost randomly distributed) are in high bands (bottom-right corner of attention plot). It can also be seen that female's pitches are higher than male's with larger intervals. To be clearly, the harmonics-related lines are not parallel to each other for the frequencies are in the ERB scale.

Then we plot the attention figures generated from noisy speech and the attention plots in decoder are given below, where the harmonic-related features are highlighted,

Ignited by ALiBi, we also conduct experiment on the multi-scale local spectral attention (MSLSA) as shown in the figure below,

the performance of MSLSA can be seen in the table below,

Full-band Metrics	STOI			SiSDR (dB)				LSD (dB)
SNR(dB)	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.		Band(kHz)		0~8	8~24	Full.
Noisy	0.687	0.805	0.771	-2.515	7.971	5.166		Noisy		18.37	12.38	14.38
MTFAA	0.805	0.876	0.856	10.10	15.74	14.23		MTFAA		10.33	9.349	9.678
MTFAA-LSA	0.809	0.881	0.860	10.34	16.20	14.63		MTFAA-LSA		9.840	8.636	9.037
MTFAA-MSLSA	0.809	0.880	0.859	10.43	15.98	14.50		MTFAA-MSLSA		10.03	8.623	9.094

Wideband Metrics	PESQ			CSIG			CBAK			COVL
SNR(dB)	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.	-5~0	0~15	Ovrl.
Noisy	1.160	1.446	1.364	2.023	2.719	2.517	1.833	2.481	2.293	1.571	2.095	1.943
MTFAA	1.981	2.669	2.470	3.465	4.113	3.925	2.951	3.523	3.357	2.754	3.436	3.238
MTFAA-LSA	2.084	2.795	2.589	3.517	4.203	4.004	3.006	3.593	3.423	2.829	3.547	3.339
MTFAA-MSLSA	2.077	2.772	2.571	3.500	4.167	3.974	3.013	3.589	3.422	2.820	3.517	3.314
Wideband Metrics	SSNR (dB)
SNR(dB)	-5~0	0~15	Ovrl.
Noisy	-2.291	4.19	2.307
MTFAA	6.550	10.13	9.094
MTFAA-LSA	6.609	10.26	9.200
MTFAA-MSLSA	6.779	10.38	9.338

The best performance of MSLSA is slightly worse than that of conventional LSA, while it can be seen from the training process that MSLSA may achieve a more stable result with a higher mean score and lower variance of PESQ in the validation set, which can be seen in the figure below (statistics based on the last 100 epochs)

Average attention plots of different heads in MSLSA are also given and it can be seen that the global attention head (N_l=F') cannot exploit clear spectral features while the local one (N_l=F'/4) makes it.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
Dataloader.py		Dataloader.py
Infer.py		Infer.py
LICENCE		LICENCE
MTFAA_Net_full.py		MTFAA_Net_full.py
MTFAA_Net_full_F_ASqbi.py		MTFAA_Net_full_F_ASqbi.py
MTFAA_Net_full_local_atten.py		MTFAA_Net_full_local_atten.py
Network_Training_MTFAA_full.py		Network_Training_MTFAA_full.py
README.md		README.md
erb.npy		erb.npy
signal_processing.py		signal_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader.py

Dataloader.py

Infer.py

Infer.py

LICENCE

LICENCE

MTFAA_Net_full.py

MTFAA_Net_full.py

MTFAA_Net_full_F_ASqbi.py

MTFAA_Net_full_F_ASqbi.py

MTFAA_Net_full_local_atten.py

MTFAA_Net_full_local_atten.py

Network_Training_MTFAA_full.py

Network_Training_MTFAA_full.py

README.md

README.md

erb.npy

erb.npy

signal_processing.py

signal_processing.py

Repository files navigation

Local Spectral Attention for Full-band Speech Enhancement

Contents

Repository description

Rquirements

Network training

Data preparation

Start training

Inference

Ablation study and experiment results

LSA on MTFAA and DPARN

About

Releases

Packages

Languages

License

ZhongshuHou/LSA

Folders and files

Latest commit

History

Repository files navigation

Local Spectral Attention for Full-band Speech Enhancement

Contents

Repository description

Rquirements

Network training

Data preparation

Start training

Inference

Ablation study and experiment results

LSA on MTFAA and DPARN

About

Topics

Resources

License

Stars

Watchers

Forks

Languages