Is there some way to create timestamps on Amazon Polly audio

171 Views Asked by At

I'm currently building a python script that uses Amazon Polly to generate audio. The goal is to use that audio and add captions to make a video.

My problem is that I can't find a way to generate timestamps for every word said by the AI, to then use on the captions.

Is there some way around it or some other solution?

1

There are 1 best solutions below

0
Hugo Novais On

I found out that polly accepts a 'SpeechMarkType' with the output format as json for this situations. This is the code I came up with to generate a .txt file with the format 'time: word'.

response = polly.synthesize_speech(
    Engine='standard',
    LanguageCode='en-US',
    OutputFormat='json',   
    Text=text,
    VoiceId=speaker,
    SpeechMarkTypes=['word']
)

audio_data = response['AudioStream'].read().decode('utf-8')
audio_lines = audio_data.strip().split('\n')

speech_marks = []
for line in audio_lines:
    try:
        mark = json.loads(line)
        speech_marks.append(mark)
    except json.JSONDecodeError as e:
        print(f"JSON decoding error: {e}")

timestamps = []
for mark in speech_marks:
    if mark['type'] == 'word':
        timestamps.append((mark['time'], mark['value']))

with open('timestamps.txt', 'w') as txt_file:
    for time, word in timestamps:
        txt_file.write(f"{time}: {word}\n")