Send OpenAI Text To Speech Wav stream to Twilio stream

141 Views Asked by At

I'm trying to send OpenAI Text to speech stream (https://platform.openai.com/docs/guides/text-to-speech/streaming-real-time-audio) to Twilio websocket, which accepts mulaw/8khz

If I wait for the entire wav buffer to stream from OpenAI, and then send this all at once to Twilio websocket, then the audio sounds fine, but I want to send chunks as soon as they are available for latency purposes. Here's code for sending entire Buffer:

function stream2buffer(stream) {
  return new Promise((resolve, reject) => {
    const _buf = [];

    stream.on("data", (chunk) => _buf.push(chunk));
    stream.on("end", () => resolve(Buffer.concat(_buf)));
    stream.on("error", (err) => reject(err));
  });
}

  async function speakAll(text) {
    const response = await openai.audio.speech.create({
      model: "tts-1",
      voice: "alloy",
      input: text,
      response_format: "wav",
    });

    return await stream2buffer(response.body);
  }
...
    import { WaveFile } from 'wavefile';

    const openAIAudio = await speakAll(response);

    const wav = new WaveFile();
    wav.fromBuffer(openAIAudio);
    wav.toSampleRate(8000);
    wav.toMuLaw();

    const mulaw = Buffer.from(wav.data.samples);
    const payload = mulaw.toString("base64");

...
      this.ws.send(
        JSON.stringify({
          event: "media",
          streamSid: this.streamSid,
          media: {
            payload,
          },
        })
      );
      this.ws.send(
        JSON.stringify({
          event: "mark",
          streamSid: this.streamSid,
          mark: {
            name: "response",
          },
        })
      );

However, if I try to try to convert the wav chunks as they arrive to mulaw and send it, I get a ton of static, with the original audio being barely discernable. Here's the code I'm using:

import { WaveFile } from 'wavefile';
import { encodeWav } from "wav-converter";


    const response = await openai.audio.speech.create({
      model: "tts-1",
      voice: "alloy",
      input: text,
      response_format: "wav",
    });

    response.body.on("data", (chunk) => {
      // add WAV headers to chunk, or else WaveFile will throw error
      const wavFile = encodeWav(chunk, {
        numChannels: 1,
        sampleRate: 24000,
        byteRate: 16,
      });

      const wav = new WaveFile(wavFile);

      wav.toSampleRate(8000);
      wav.toMuLaw();

      const mulaw = Buffer.from(wav.data.samples);
      let payload = mulaw.toString("base64");

      try {
        this.ws.send(
          JSON.stringify({
            event: "media",
            streamSid: this.streamSid,
            media: {
              payload,
            },
          })
        );

        this.ws.send(
          JSON.stringify({
            event: "mark",
            streamSid: this.streamSid,
            mark: {
              name: "response",
            },
          })
        );
      } catch (e) {
        this.L.error("failed to send voice response to ws: " + e);
      }
    });

If I concatenate a few wav chunks together, and then converting to mulaw, I get slightly better results, but still a lot of static. I'm wonder if there's something I'm missing with chunk size alignment?

0

There are 0 best solutions below