I have an audio file and corresponding video (which were recorded synchronously) and I'd like to match up every frame of the video to a corresponding pitch using the praat-parselmouth package.
First, since when calling f0 = sound.to_pitch() I usually get more pitch samples than I have video frames, I just subsampled the pitches with f0 = f0[np.round(np.linspace(0, len(f0)-1, len(video_imgs)).astype(int))].
However, in this case some or a lot of synchronicity is lost (I think)...
So I thought I'd use the time step parameter of to_pitch() in the following function:
def get_f0_pitch(audio_path, num_samples):
sound = parselmouth.Sound(audio_path)
time_step = sound.duration / num_samples # floating point shenanigans ?
f0 = sound.to_pitch(time_step)
return f0
But for larger num_samples this delivers slightly less samples of f0 than num_samples due to the floating point operation (I think).
Should I just do the subsampling since it will still stay fairly synchronous? I couldn't come up with any solutions for the time step issue and I want to avoid resampling the audio if possible.