I want to make a live transcription app with nodejs and google speech to text api.
I am using RecordRTC and socket.io to get audio chunks to the backend server. At the moment I am recording 1 s long chunks and the transciption works but it does not treat it as a stream, it sends a response after processing each chunk. This means that I am getting back half sentences and google can't use the context to help itself recognize the speech.
My question is how to make that google treats my chunks as a continous stream. Or if there is another solution to achieve the same result? (which is transcribing the mic audio live, or very close to live).
Google has a demo on their website which does exactly what I am looking for so it should be possible to do it.
My code: (which is mainly from selfservicekiosk-audio-streaming repo)
ss is socket.io-stream
Server side
io.on("connect", (socket) => {
        socket.on("create-room", (data, cb) => createRoom(socket, data, cb))
        socket.on("disconnecting", () => exitFromRoom(socket))
        // getting the stream, it gets called every 1s with a blob
        ss(socket).on("stream-speech", async function (stream: any, data: any) {
            const filename = path.basename("stream.wav")
            const writeStream = fs.createWriteStream(filename)
           
            stream.pipe(writeStream)
            speech.speechStreamToText(
                stream,
                async function (transcribeObj: any) {
                    socket.emit("transcript", transcribeObj.transcript)
                }
            )
        })
async speechStreamToText(stream: any, cb: Function) {
        sttRequest.config.languageCode = "en-US"
        sttRequest = {
            config: {
                sampleRateHertz: 16000,
                encoding: "WEBM_OPUS",
                enableAutomaticPunctuation: true,
            },
            singleUtterance: false,
        }
        const stt = speechToText.SpeechClient()
        //setup the stt stream
        const recognizeStream = stt
            .streamingRecognize(sttRequest)
            .on("data", function (data: any) {
                //this gets called every second and I get transciption chunks which usually have close to no sense
                console.log(data.results[0].alternatives)
            })
            .on("error", (e: any) => {
                console.log(e)
            })
            .on("end", () => {
                //this gets called every second. 
                console.log("on end")
            })
        stream.pipe(recognizeStream)
        stream.on("end", function () {
            console.log("socket.io stream ended")
        })
    }
Client side
const sendBinaryStream = (blob: Blob) => {
    const stream = ss.createStream()
    ss(socket).emit("stream-speech", stream, {
        name: "_temp/stream.wav",
        size: blob.size,
    })
    ss.createBlobReadStream(blob).pipe(stream)
}
useEffect(() => {
        let recorder: any
        if (activeChat) {
            navigator.mediaDevices.getUserMedia({ audio: true, video: false }).then((stream) => {
                streamRef.current = stream
                recorder = new RecordRTC(stream, {
                    type: "audio",
                    mimeType: "audio/webm",
                    sampleRate: 44100,
                    desiredSampleRate: 16000,
                    timeSlice: 1000,
                    numberOfAudioChannels: 1,
                    recorderType: StereoAudioRecorder,
                    ondataavailable(blob: Blob) {
                        sendBinaryStream(blob)
                    },
                })
                recorder.startRecording()
            })
        }
        return () => {
            recorder?.stopRecording()
            streamRef.current?.getTracks().forEach((track) => track.stop())
        }
    }, [])
Any help is appreciated!
                        
I have exactly the same question!
Maybe google oficial demo is using node-record-lpcm16 with SoX: https://cloud.google.com/speech-to-text/docs/streaming-recognize?hl=en