I am working on an application that involves transcribing audio inputs to text, translating the text into Spanish or French, and then synthesizing speech from the translated text. The application is built using JavaScript for the frontend and Python for backend processing.
I'm using Google's Speech-to-Text API for transcription and Google's Text-to-Speech API for speech synthesis.
Here's an overview of the application flow:
- Transcribe audio inputs from the microphone into text using Google Speech-to-Text API.
- Translate the transcribed text into Spanish or French based on user selection.
- Synthesize speech from the translated text in the chosen language.
The issue I'm facing is that the synthesized speech, while accurate in translation, sounds like an English speaker speaking Spanish or French, lacking the native intonation and pronunciation.
Current JavaScript Implementation (Focused on Language Selection and Speech Synthesis):
Below are key portions of my JavaScript code that deal with language selection and interfacing with the Google Text-to-Speech API. This includes capturing the user's language choice, sending a request for speech synthesis in the selected language, and handling the synthesized speech output.
// Snippet for capturing language selection
const selectedLanguage = document.getElementById('language-dropdown').value;
// Snippet for sending a request to the Google Text-to-Speech API
function playSynthesizedSpeech(text, language) {
if (isSynthesizingSpeech) {
console.log("A speech synthesis request is already in progress.");
return;
}
isSynthesizingSpeech = true;
fetch('http://127.0.0.1:5000/synthesize_speech', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({text: text, language: language}),
}).then(response => response.blob())
.then(blob => {
enqueueAudio(blob); // Enqueueing the audio for playback
}).catch(error => {
console.error('Error:', error);
isSynthesizingSpeech = false; // Reset flag on error
});
}
Note: This code demonstrates how the application selects the language for speech synthesis and processes the audio output. The full application includes additional functionality for audio recording, transcription, and translation.
I'm particularly interested in any advice on improving the handling and customization of speech synthesis requests to achieve more native-sounding speech outputs in the selected languages.
Frontend design looks like the below: enter image description here
I would really appreciate your support on this please!! Thanks in advance.
I've explored several ways to improve the naturalness of the speech output in Spanish and French:
I've experimented with different voice options and settings within the Google Text-to-Speech API, such as adjusting the speaking rate and pitch. However, while these tweaks have slightly varied the speech's characteristics, they haven't significantly closed the gap toward achieving a truly native sound.
I attempted to select voices that are supposedly native in the target languages, but the output still prominently carries an English accent in terms of pronunciation and intonation.