So I'm trying to replicate the process of obtaining MFCC from an audio file. So far I have obtained the Mel Spectrogram, and the last step is to perform Discrete Cosine Transform to the Mel Spectrogram. I've tried using scipy's dct() function to the spectrogram but it's still not quite what I'm looking for. I cross checked with Librosa's MFCC function too and it's still different. Please help, and thank you in advance!
Here are the codes that I used to generate the Mel Spectrogram
# Function to perform STFT on each window
def stft(signal, windowSize, windowStep):
# Frame number estimation
n_frames = 1 + int((len(signal)-windowSize)/windowStep)
# Initialize empty matrix, to store STFT result
stft_matrix = np.zeros((n_frames, int(windowSize/2)+1),dtype=np.float32)
# Loop to perform STFT, keep only the nyquist freqs
for i in range(n_frames):
start = i * windowStep
end = start + windowSize
frame = signal[start:end]*np.hanning(windowSize)
frame_fft = np.fft.fft(frame)[:int(windowSize/2)+1]
stft_matrix[i, :] = np.abs(frame_fft)
return stft_matrix
# Input signal
wav_name = '0015_000009_neutral.wav'
x, sr = librosa.load(wav_name, sr=None) # sr = none
# Initialize window step and length
window_size = 0.025 # 25 ms
window_step = 0.010 # 10 ms
stft_matrix = stft(x, int(window_size * sr), int(window_step * sr))
# Plot vanilla spectrogram
# Transpose
stftTranspose = stft_matrix.transpose()
# Convert STFT to dB-scaled spectrogram
spectrogram = librosa.amplitude_to_db(stftTranspose, ref=np.max)
# Set up x-axis and y-axis parameters
time_axis = np.arange(spectrogram.shape[1])
freq_axis = np.arange(spectrogram.shape[0])
# Plot the spectrogram
librosa.display.specshow(spectrogram, x_axis='time', y_axis='linear', sr=sr, hop_length=int(window_step * sr))
# Add colorbar and labels
plt.colorbar(format='%+2.0f dB')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
# Constructing Mel Filterbank
frameSize = int(window_size*sr)
hopLength = int(window_step*sr)
melFilters = librosa.filters.mel(n_fft=frameSize, sr=sr, n_mels=128)
melFilters.shape
melFilters /= np.max(melFilters, axis=-1)[:, None] # Librosa uses Slaney, normalized triangular filter, this turns the filter into regular triangular filterbank
plt.plot(melFilters.T)
# Matrix multiplication between Mel Filterbank and Spectrogram
melSpec = np.dot(melFilters, stftTranspose**2 )
melSpec.shape
# Log
logMel = librosa.amplitude_to_db(S=melSpec, ref=np.max)
logMel.shape
# Plotting the mel spectrogram
plt.figure(figsize=(25, 10))
librosa.display.specshow(logMel, sr=sr, hop_length=hopLength, x_axis='time', y_axis='mel', fmax=sr/2)
plt.colorbar(format='%+2.f dB')
plt.title('Mel spectrogram')
# Trying to apply DCT to the Mel Spectrogram
mfcc = fft.dct(logMel)
mfcc.shape
plt.figure(figsize=(25, 10))
librosa.display.specshow(mfcc, sr=sr, hop_length=hopLength, x_axis='time', y_axis='mel', fmax=sr/2)
plt.colorbar(format='%+2.f dB')
plt.title('MFCC')
The plotted MFCC isn't the same as Librosa's MFCC plot, what should I do to apply the DCT to the mel spectrogram? Here are the MFCC plot comparison: