Separating User Speech from Chatbot Speech in Real-Time Audio Streams with Twilio and Google Speech-to-Text

140 Views Asked by At

I'm currently working on an application that involves a chatbot interacting with user through twilio media streams. One of the primary challenges I'm encountering is accurately differentiating user speech from chatbot speech within the incoming audio stream. The goal is to ensure that certain actions are triggered exclusively during user speech, while avoiding interference during chatbot responses.

Currently, I'm using Google Speech-to-Text to transcribe user speech, but the quality of the translation deteriorates when the chatbot's voice is present in the audio stream with user's voice. My goal is to improve translation accuracy by either isolating the chatbot's voice from the audio bytes before translation or by preventing certain actions when the chatbot is speaking.

ref:

incoming_text = await get_text_from_audio(incoming_speech)

incoming_speech is in bytes.

Tech Stack:

  • Twilio for voice interactions
  • Google Speech-to-Text for transcription
  • Text-to-speech for chatbot responses

Current Approach:

User speech is sent to Google Speech-to-Text for translation. However, chatbot voice with user voice in the audio stream degrades translation accuracy. I'm using FastAPI for application development.

Specific Challenges:

How can I remove the chatbot's voice from the audio bytes before sending them to Google Speech-to-Text to ensure accurate transcription? Alternatively, how can I prevent the execution of specific actions when the chatbot is speaking? May be I should detect twilio voice frequency and then filter it out.

0

There are 0 best solutions below