I have been pulling my hair on a problem, and all of the answers I found on StackOverflow so far did not help - so I am asking for your help.
The overall problem
I would like to create a function that finds the exact timestamp where an audio excerpt starts in a larger audio file. For test purposes, I used a 5 minutes audio file and a 43 second excerpt of it. Below, I aligned the two audio files in Audacity: the excerpt starts exactly at 00:01:55.554920.
I would also like the function to return a value if, and only if it has a confidence value over a certain threshold, that would actually be a parameter of the function. The way I intend to do it is by checking if the correlation coefficient between the two aligned signals is over the given threshold.
In other words, here is a simplified version of the code:
find_excerpt_starting_sample(original_audio, excerpt, threshold):
# Find the cross-correlation coefficients for each lag
xcorr = cross_correlation(original_audio, excerpt)
# Return the lag of the max correlation if it is over threshold
if np.max(xcorr) > threshold:
return np.argmax(xcorr)
else:
raise Exception("No correlation over threshold found.")
I have been having a lot of troubles finding the right cross_correlation function, because none of my attempts returned an array that would be between 0 and 1.
The problem, simplified
As my attempts on the audio files have been inconclusive, I have tried to perform the same thing on two numerical arrays:
y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [4, 8, 16, 26, 6, 12]
Here, y2 contains a subset of y1 (starting at index 5). In order to make sure that the function works independently of the amplitude scale, I halved all of the values of y2:
y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [2, 4, 8, 13, 3, 6]
I would like to create a cross-correlation function that returns an array where the value at lag 5 is 1.
My attempts so far
np.corrcoef
If we just do a simple correlation and slide the excerpt along the original audio, it works:
import numpy as np
import matplotlib as plt
corr = np.zeros(len(y1) - len(y2))
for i in range(len(y1) - len(y2)):
corr[i] = np.corrcoef(y1[i:i+len(y2)], y2)[0][1]
print(corr)
plt.plot(corr)
plt.show()
The output is:
[ 0.18961375 -0.71250433 -0.56075283 -0.08468414 0.21913077 1. -0.04179451 -0.46803451 -0.24815461]
The problem is that this technique is really, really not efficient for longer files.
scipy.signal.correlate
Now, instead of reinventing the wheel, I started using one of the main solutions found on Stack Overflow, i.e. the correlate function of scipy.signal. It returns values where the proper lag is found. However, as is, because it performs a convolution, there is no way to quantify the correlation.
from scipy import signal
xcorr = signal.correlate(y1, y2, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(xcorr)
print(lags)
plt.plot(lags, xcorr)
plt.show()
The output is:
[ 12 138 176 392 390 332 224 232 356 402 596 486 478 414 422 252 186 88 28 12]
[-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
I saw a few solutions, but they do not work as I intend.
First, a solution here suggests to normalize the coefficient using this function:
corr = signal.correlate(y1 / np.std(y1), y2 / np.std(y2), 'full') / min(len(y1), len(y2))
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(c)
plt.plot(lags, c)
plt.show()
The output is:
[0.0736392 0.84685082 1.08004163 2.40554727 2.39327407 2.03735126 1.37459844 1.42369124 2.18462966 2.46691327 3.65741371 2.98238769 2.93329489 2.54055247 2.58964528 1.54642325 1.14140763 0.54002082 0.17182481 0.0736392 ]
As you can see, the maximum value is not 1, but 3.65741371.
I then tried another solution found here:
y1n = y1 / np.std(y1)
y2n = y2 / np.std(y2)
xcorr = signal.correlate(y1n, y2n, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(xcorr)
plt.plot(lags, xcorr)
plt.show()
The output is:
[ 0.44183521 5.08110495 6.48024979 14.43328362 14.35964442 12.22410756 8.24759064 8.54214745 13.10777798 14.80147963 21.94448224 17.89432612 17.59976931 15.24331484 15.53787165 9.27853947 6.8484458 3.24012489 1.03094883 0.44183521]
Once again, the max value for the cross-correlation is not 1 but 21.94448224
A cry for help
There is a lot I do not know about correlation - I dug into it, but before going deeper I'm asking, if one of you happened to be able to point me in the right direction, and what I did wrong so far.
Thanks a lot!






You can use convolution for this:
This function is a lot of magnitudes faster to the original solution