I'm trying to send OpenAI Text to speech stream (https://platform.openai.com/docs/guides/text-to-speech/streaming-real-time-audio) to Twilio websocket, which accepts mulaw/8khz
If I wait for the entire wav buffer to stream from OpenAI, and then send this all at once to Twilio websocket, then the audio sounds fine, but I want to send chunks as soon as they are available for latency purposes. Here's code for sending entire Buffer:
function stream2buffer(stream) {
return new Promise((resolve, reject) => {
const _buf = [];
stream.on("data", (chunk) => _buf.push(chunk));
stream.on("end", () => resolve(Buffer.concat(_buf)));
stream.on("error", (err) => reject(err));
});
}
async function speakAll(text) {
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: text,
response_format: "wav",
});
return await stream2buffer(response.body);
}
...
import { WaveFile } from 'wavefile';
const openAIAudio = await speakAll(response);
const wav = new WaveFile();
wav.fromBuffer(openAIAudio);
wav.toSampleRate(8000);
wav.toMuLaw();
const mulaw = Buffer.from(wav.data.samples);
const payload = mulaw.toString("base64");
...
this.ws.send(
JSON.stringify({
event: "media",
streamSid: this.streamSid,
media: {
payload,
},
})
);
this.ws.send(
JSON.stringify({
event: "mark",
streamSid: this.streamSid,
mark: {
name: "response",
},
})
);
However, if I try to try to convert the wav chunks as they arrive to mulaw and send it, I get a ton of static, with the original audio being barely discernable. Here's the code I'm using:
import { WaveFile } from 'wavefile';
import { encodeWav } from "wav-converter";
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: text,
response_format: "wav",
});
response.body.on("data", (chunk) => {
// add WAV headers to chunk, or else WaveFile will throw error
const wavFile = encodeWav(chunk, {
numChannels: 1,
sampleRate: 24000,
byteRate: 16,
});
const wav = new WaveFile(wavFile);
wav.toSampleRate(8000);
wav.toMuLaw();
const mulaw = Buffer.from(wav.data.samples);
let payload = mulaw.toString("base64");
try {
this.ws.send(
JSON.stringify({
event: "media",
streamSid: this.streamSid,
media: {
payload,
},
})
);
this.ws.send(
JSON.stringify({
event: "mark",
streamSid: this.streamSid,
mark: {
name: "response",
},
})
);
} catch (e) {
this.L.error("failed to send voice response to ws: " + e);
}
});
If I concatenate a few wav chunks together, and then converting to mulaw, I get slightly better results, but still a lot of static. I'm wonder if there's something I'm missing with chunk size alignment?