Recognize start of piano music in an MP3 file which starts with a spoken introduction, and remove spoken part, using Python

102 Views Asked by At

I have a number of .mp3 files which all start with a short voice introduction followed by piano music. I would like to remove the voice part and just be left with the piano part, preferably using a Python script. The voice part is of variable length, ie I cannot use ffmpeg to remove a fixed number of seconds from the start of each file. Is there a way of detecting the start of the piano part and then know how many seconds to remove using ffmpeg or even using Python itself?. Thank you

1

There are 1 best solutions below

1
kerasbaz On

This is a non-trivial problem if you want a good outcome.

Quick and dirty solutions would involve inferred parameters like:

  • "there's usually 15 seconds of no or low-db audio between the speaker and the piano"
  • "there's usually not 15 seconds of no or low-db audio in the middle of the piano piece"

and then use those parameters to try to get something "good enough" using audio analysis libraries.

I suspect you'll be disappointed with that approach given that I can think of many piano pieces with long pauses and this reads like a classic ML problem.

The best solution here is to use ML with a classification model and a large data set. Here's a walk-through that might help you get started. However, this isn't going to be a few minutes of coding. This is a typical ML task that will involve collecting and tagging lots of data (or having access to pre-tagged data), building a ML pipeline, training a neural net, and so forth.

Here's another link that may be helpful. He's using a pretrained model to reduce the amount of data required to get started, but you're still going to put in quite a bit of work to get this going.