I'm currently building a python script that uses Amazon Polly to generate audio. The goal is to use that audio and add captions to make a video.
My problem is that I can't find a way to generate timestamps for every word said by the AI, to then use on the captions.
Is there some way around it or some other solution?
I found out that polly accepts a 'SpeechMarkType' with the output format as json for this situations. This is the code I came up with to generate a .txt file with the format 'time: word'.