Offset in timing when transcribing noise-reduced audio with OpenAI's Whisper

728 Views Asked by Hana Baron At 25 June 2023 at 09:14

I'm working on a project that involves transcribing audio files using OpenAI's Whisper. To improve the quality of the transcriptions, I'm trying to reduce the noise in my audio files using the reduce_noise function from the noisereduce Python library before passing them into Whisper.

I've noticed an issue where the transcriptions from the noise-reduced audio files have an offset in timing compared to the transcriptions from the original audio files. The offset is not constant throughout the audio file.

Here's a rough outline of my process:

Apply noise reduction to the audio using noisereduce.
Transcribe both the original and noise-reduced audio using Whisper.
Choose the transcription result with the longer length of text from whisper_result['text'].
Use the segments key from the Whisper result for timing because it provides timing for each word, which I use for subtitles.

When I use the segments from the transcription result of the original audio, the timing aligns correctly. However, when I use the segments from the transcription result of the noise-reduced audio, there is an offset in the timing. I also have to use the original audio and not the de-noised.

Here's an example of my code:

# Assume 'orig_audio_data' is my original audio file data
# Apply noise reduction
clean_audio = noisereduce.reduce_noise(orig_audio_data, rate)

# Transcribe both original and cleaned audio using Whisper
# (assume 'transcribe' is a function to do this)

original_transcription = model_medium.transcribe(orig_audio)
clean_transcription = model_medium.transcribe(clean_audio)

# Choose the longer transcription
if len(original_transcription['text']) > len(clean_transcription['text']):
    chosen_transcription = original_transcription
else:
    chosen_transcription = clean_transcription

# Use 'segments' for timing

segments = chosen_transcription['segments']

I tried the following but had the same output -

Other noise-reduction libraries.
Renormalizing the audio after the noise reduction step.
Fixing the timing. The offset changes during the audio, making fixing it very difficult.
Saving the cleaned audio and loading it again. (loading and exporting with librosa and soundfile)
Creating segments from the audio, applying the reduce_noise on each and combining the result - Whisper results are much less accurate.

Has anyone else encountered this issue or have suggestions on what might be causing this and how to fix it?

Original Q&A

There are 1 best solutions below

Hasankhan On 19 December 2023 at 07:24

You should use VAD (Voice Activity Detector). Whisper has already been trained with real-time data collected from various environments (paper) . Noise is also a feature that helps predict words by Whisper. However, it is a generated model, and when a user sends an empty package, it will generate transcriptions for every audio packet which create inaccuracy, try to send audio with good sample rate it can be increase your accuracy.

Offset in timing when transcribing noise-reduced audio with OpenAI's Whisper

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SPEECH-TO-TEXT

Related Questions in AUDIO-PROCESSING

Related Questions in NOISE-REDUCTION

Related Questions in OPENAI-WHISPER

Trending Questions

Popular # Hahtags

Popular Questions