Does Google Speech-to-Text Consume Internet Data for Silent Audio Input in Python?

78 Views Asked by At

I am currently using Python's Speech Recognition library with Google's Speech-to-Text functionality. My concern revolves around cases where there is no speech detected during the audio processing. Despite the absence of speech, I understand that the audio data is still sent to Google for analysis.I am making a voice assistant like alexa and google home that require to hear the word like 'hey google" etc.

My question is: Does Google Speech-to-Text consume internet data even when there is no speech detected in the audio input? I want to ensure that I'm not incurring unnecessary internet usage for silent audio inputs. How to solve this issue, can it locally identify silent audio. Any insights or clarifications on this matter would be greatly appreciated. Thank you!

1

There are 1 best solutions below

2
Seon On

Assuming your code looks something like this:

import speech_recognition as sr

# obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
    audio = r.listen(source)
print("Google Speech Recognition thinks you said " + r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY"))

In that case, silent segments will not be send to the google API, as Recognizer.listen waits for the audio to reach a specific volume before recording. This value can be configured when initializing the Recognizer. From the documentation:

def listen(self, source, timeout=None, phrase_time_limit=None, snowboy_configuration=None)

Records a single phrase from source (an AudioSource instance) into an AudioData instance, which it returns. This is done by waiting until the audio has an energy above recognizer_instance.energy_threshold (the user has started speaking), and then recording until it encounters recognizer_instance.pause_threshold seconds of non-speaking or there is no more audio input. The ending silence is not included.

The timeout parameter is the maximum number of seconds that this will wait for a phrase to start before giving up and throwing an speech_recognition.WaitTimeoutError exception. If timeout is None, there will be no wait timeout.

The phrase_time_limit parameter is the maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None, there will be no phrase time limit.