Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching MFCC configuration with Librosa's #54

Open
Yaxit opened this issue Oct 5, 2022 · 3 comments
Open

Matching MFCC configuration with Librosa's #54

Yaxit opened this issue Oct 5, 2022 · 3 comments
Labels
question Further information is requested

Comments

@Yaxit
Copy link

Yaxit commented Oct 5, 2022

Hello,

I am trying to replicate the MFCC output of Librosa, which is widely used as the reference library for audio manipulation.

   mfccs = librosa.feature.mfcc(y=data,
        sr=16000,
        n_mfcc=32,
        n_fft = 2048,
        hop_length = 1024,
        window='hamming',
        center=False,            # for padding, not used
        pad_mode='constant',    # for padding, not used
        power=2.0,
        n_mels=32,
        fmin=64,
        fmax=8000,
        dct_type=2,
        norm='ortho'
        )

On my ARM microcontroller, I am using the arm_mfcc_f32 callback and the arm_mfcc_init_f32 to initialize the parameters.
The parameters have been generated with the scripts from this repository.

Generally, I would expect some small deviation from the results on the two platforms, since CMSIS uses approximated calculations for logarithms, etc; however, I found that the outputs are completely different (one order of magnitude, sign).

I also could not find a clear explanation of which parameters are used in the CMSIS implementation for power, dct_type, and norm.
Am I missing some steps? Or is this expected altogether?

I believe this would be quite important to clarify, to allow people to design a data pipeline that can be quite reproducible in the microcontroller :)

@christophe0606 christophe0606 added the review Under review label Oct 6, 2022
@christophe0606
Copy link
Contributor

@Yaxit It was tested with the MFCC from TensorFlow using Hamming window

If Librosa is a reference then it may be worth looking at it for a future improvement.

The dct_type and norm look coherent with the TensorFlow version : https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms

From TensorFlow documentation:

Mel-Frequency Cepstral Coefficient (MFCC) calculation consists of taking the DCT-II of a log-magnitude mel-scale spectrogram. HTK's MFCCs use a particular scaling of the DCT-II which is almost orthogonal normalization. We follow this convention.

I think power should be 1.

@christophe0606
Copy link
Contributor

@Yaxit I have made some tests. Here is a Python script to compute the 3 versions : TensorFlow, CMSIS-DSP, Librosa. I am using the CMSIS-DSP Python wrapper for testing (pip install cmsisdsp).

The results I am getting:

TF
tf.Tensor(
[-60.15401347  -0.33196457  -6.04726881  -7.64207099  -3.54823378
   1.31988137   3.88698841   2.17443039  -1.33955457  -3.54223212
  -2.35875732   0.7373274    2.93307946], shape=(13,), dtype=float64)
CMSIS-DSP
[-60.154556    -0.33157736  -6.0473366   -7.6420746   -3.5485296
   1.3204467    3.8861852    2.1752558   -1.3400406   -3.5417192
  -2.3583646    0.7381086    2.9335299 ]
Librosa
[-352.07525027   24.55651523  -26.26419471  -30.30838539  -15.41778675
    6.76462123   16.88078724    9.97585971   -5.81466217  -15.06582109
  -10.2487501     3.41145905   12.73910052]

I don't understand all the settings of Librosa. I played with some of them and I can't get the same result as TensorFlow.

I was hoping that the Librosa setting htk=True may help but it did not.

So, it looks like Librosa is using different convention and formula.

For below script, I have started from the TensorFlow example in the documentation and I have reused the same parameters:

import tensorflow as tf 
import cmsisdsp as dsp
import librosa
import numpy as np
import scipy.signal as sig
import cmsisdsp.mfcc as mfcc
from cmsisdsp.datatype import F32
import math 

# https://librosa.org/doc/main/generated/librosa.feature.mfcc.html
# https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms

FFTSize = 1024
numOfDctOutputs = 13
    
freq_min = 80.0
freq_high = 7600.0
numOfMelFilters = 80
frame_step = 256

num_samples, sample_rate = 32000, 16000.0

t = np.linspace(0,2,num_samples)
pcm = np.array(np.sin(2*math.pi * t * 1000),dtype=np.double)


# A 1024-point STFT with frames of 64 ms and 75% overlap.
stfts = tf.signal.stft(pcm, frame_length=FFTSize, frame_step=frame_step,
                       fft_length=FFTSize,
                       window_fn=tf.signal.hamming_window)
spectrograms = tf.abs(stfts)


# Warp the linear scale spectrograms into the mel-scale.
num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = freq_min, freq_high, numOfMelFilters
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
  num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
  upper_edge_hertz,dtype=tf.double)

#print(spectrograms.dtype)
#print(linear_to_mel_weight_matrix.dtype)
mel_spectrograms = tf.tensordot(
  spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
  linear_to_mel_weight_matrix.shape[-1:]))

# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)

# Compute MFCCs from log_mel_spectrograms and take the first 13.
mfccsT = tf.signal.mfccs_from_log_mel_spectrograms(
  log_mel_spectrograms)[..., :numOfDctOutputs]

mfccsL = librosa.feature.mfcc(y=np.array(pcm),
        sr=sample_rate,
        n_mfcc=numOfDctOutputs,
        n_fft = FFTSize,
        hop_length = frame_step,
        window=sig.hamming(FFTSize, sym=False),
        center=False,            # for padding, not used
        pad_mode='constant',    # for padding, not used
        power=1.0,
        n_mels=numOfMelFilters,
        fmin=freq_min,
        fmax=freq_high,
        dct_type=2,
        norm='ortho',
        htk=True
        )

print("TF")
print(mfccsT[0])

window = sig.hamming(FFTSize, sym=False)
      
filtLen,filtPos,packedFilters = mfcc.melFilterMatrix(F32,freq_min, freq_high, numOfMelFilters,sample_rate,FFTSize)

       
dctMatrixFilters = mfcc.dctMatrix(F32,numOfDctOutputs, numOfMelFilters)

mfccf32=dsp.arm_mfcc_instance_f32()

status=dsp.arm_mfcc_init_f32(mfccf32,FFTSize,numOfMelFilters,numOfDctOutputs,dctMatrixFilters,
    filtPos,filtLen,packedFilters,window)

tmp=np.zeros(FFTSize + 2)
res=dsp.arm_mfcc_f32(mfccf32,pcm[0:FFTSize],tmp)

print("CMSIS-DSP")
print(res)

print("Librosa")
print(np.array(mfccsL).T[0])

@Yaxit
Copy link
Author

Yaxit commented Oct 7, 2022

@christophe0606 Thank you for the clarification.
I also got the same results with librosa having different results.
I'm trying to discover exactly what is different, and why. It would be nice to understand where the difference lies.
I'll keep you posted.

@christophe0606 christophe0606 added question Further information is requested and removed review Under review labels Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants