I have an example of an audio ad and a recording of a radio stream. I want to find out how many times this specific ad repeats in this stream. Tried to use audio fingerprinting and librosa library. But my command seems to run forever, I've already waited for 30 minutes and it is still running. Can you please say what is wrong with my code? How can I optimize it?
import numpy as np
import librosa
def count_repeats(ref_file, longer_file):
# Load the reference and longer audio signals
ref_audio, ref_sr = librosa.load(ref_file, sr=None, mono=True)
longer_audio, longer_sr = librosa.load(longer_file, sr=None, mono=True)
# Compute the cross-correlation between the reference and longer audio signals
corr = np.correlate(ref_audio, longer_audio, mode='full')
# Find the time lag that maximizes the cross-correlation
lag = np.argmax(corr) - len(ref_audio) + 1
# Compute the duration of the reference and longer audio signals
ref_duration = len(ref_audio) / ref_sr
longer_duration = len(longer_audio) / longer_sr
# Compute the number of repeats based on the time lag and the duration of the audio signals
repeats = int(np.floor((longer_duration - lag) / ref_duration))
return repeats
ref_file = 'house_ad.mp3'
longer_file = 'long_audio.mp3'
repeats = count_repeats(ref_file, longer_file)
print(f'Number of repeats: {repeats}')
The reason your function is very slow probably that cross correlation is a O(n**2) operation, and with audio sample-rates n is very large. With a 60 second and a 60 minute, it is an estimated 105*10e12 operations.
And while cross-correlation is conceptually a valid approach for comparing a signal against a template it is quite fragile - as even the tiniest change in the audio will throw it off.
A much more robust and also more computationally efficient approach is to do matching in another feature space. For audio a good starting point may be MFCC.
A basic but very effective method of signal matching is described in DiagonalMatching from the book Fundamentals of Music Processing Using Python and Jupyter Notebooks (FMP) by Meinard Müller.
Below follows complete code following this method. On your example, it manages to successfully identify the 3 locations of the provided clip at the correct locations.
NOTE: This approach assumes that the audio clip being queried for is very similar each time it appears. But if there is considerable variation in the clip, or it or occurs together with other sounds et.c, more advanced approaches will be needed.
The complete notebook can also be found here.
Here is how to use it
It should output
and give a plot like this.