I'm working on a project that involves transcribing audio files using OpenAI's Whisper. To improve the quality of the transcriptions, I'm trying to reduce the noise in my audio files using the reduce_noise function from the noisereduce Python library before passing them into Whisper.
I've noticed an issue where the transcriptions from the noise-reduced audio files have an offset in timing compared to the transcriptions from the original audio files. The offset is not constant throughout the audio file.
Here's a rough outline of my process:
- Apply noise reduction to the audio using noisereduce.
- Transcribe both the original and noise-reduced audio using Whisper.
- Choose the transcription result with the longer length of text from whisper_result['text'].
- Use the segments key from the Whisper result for timing because it provides timing for each word, which I use for subtitles.
When I use the segments from the transcription result of the original audio, the timing aligns correctly. However, when I use the segments from the transcription result of the noise-reduced audio, there is an offset in the timing. I also have to use the original audio and not the de-noised.
Here's an example of my code:
# Assume 'orig_audio_data' is my original audio file data
# Apply noise reduction
clean_audio = noisereduce.reduce_noise(orig_audio_data, rate)
# Transcribe both original and cleaned audio using Whisper
# (assume 'transcribe' is a function to do this)
original_transcription = model_medium.transcribe(orig_audio)
clean_transcription = model_medium.transcribe(clean_audio)
# Choose the longer transcription
if len(original_transcription['text']) > len(clean_transcription['text']):
chosen_transcription = original_transcription
else:
chosen_transcription = clean_transcription
# Use 'segments' for timing
segments = chosen_transcription['segments']
I tried the following but had the same output -
- Other noise-reduction libraries.
- Renormalizing the audio after the noise reduction step.
- Fixing the timing. The offset changes during the audio, making fixing it very difficult.
- Saving the cleaned audio and loading it again. (loading and exporting with librosa and soundfile)
- Creating segments from the audio, applying the reduce_noise on each and combining the result - Whisper results are much less accurate.
Has anyone else encountered this issue or have suggestions on what might be causing this and how to fix it?
You should use VAD (Voice Activity Detector). Whisper has already been trained with real-time data collected from various environments (paper) . Noise is also a feature that helps predict words by Whisper. However, it is a generated model, and when a user sends an empty package, it will generate transcriptions for every audio packet which create inaccuracy, try to send audio with good sample rate it can be increase your accuracy.