How to resample from 8K to 16K with librosa or torchaudio as ffmpeg do it?

386 Views Asked by At

In my app,

  • I'm getting array of audio sample (with sample rate =8000) which was loaded with torchaudio.load

  • I need to use this audio array and run whisper (STT).

  • I want to avoid from loading the wav file again with whisper (load_audio) (for efficiency) and to resample the array to 16000.

  • whisper.load_audio use ffmpeg to load and resample the audio to 16000. I'm trying to use librosa or torchaudio and resample the audio array but It always seems that the resample methods are not the same.

  • (I assume that if I use other resample method not as the whisper model was trained on, I can get bad results).

Example: loading test.wav file (with SR=8000) and print the 5 first cells: whisper_audio = whisper.load_audio(file) => [-0.00082397 -0.00115967 -0.00186157 -0.00231934 -0.00222778, ...]

loading with torchaudio and resample it with librosa: librosa.resample(vad_audio, orig_sr=8000, target_sr=16000, scale=True, res_type='kaiser_best') => [-0.00082317 -0.0010577 -0.0013937 -0.0016688 -0.00186235

seems different values.

How can I resample the audio in the exact way ffmpeg do it ?

1

There are 1 best solutions below

0
On

You can use torchaudio.io.StreamReader to load and resample audio. This functionality is implemented with ffmpeg, so you might be able to produce the same waveform.

When you use the add_basic_audio_stream method with sample_rate option, it will use FFmpeg's filter function to apply resampling.

https://pytorch.org/audio/2.1.1/generated/torchaudio.io.StreamReader.html#add-basic-audio-stream

If the ffmpeg command is using non-default re-sampling method, you need to construct the same filter description and pass it to add_audio_stream method.

https://pytorch.org/audio/2.1.1/generated/torchaudio.io.StreamReader.html#add-audio-stream