I have python code:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import torchaudio
import soundfile as sf
import speechbrain as sb
from speechbrain.pretrained import SpeakerRecognition
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
text_input = input("Enter text in English: ")
inputs = processor(text=text_input, return_tensors="pt")
spk_rec = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
embeddings_dataset = load_dataset("vctk", trust_remote_code=True)
wav = sb.dataio.dataio.read_audio(embeddings_dataset['train'][5841]['file'])
speaker_embeddings = spk_rec.encode_batch(wav.unsqueeze(0))
speaker_embeddings = speaker_embeddings.squeeze(0)
num_tokens = inputs["input_ids"].shape[1]
speaker_embeddings = speaker_embeddings.unsqueeze(1).unsqueeze(2).unsqueeze(3).unsqueeze(4).expand(1, 1, 1, 1, num_tokens, 192)
spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
with torch.no_grad():
speech = vocoder(spectrogram)
sf.write("output.wav", speech.numpy(), samplerate=16000)
but I have error and warnings. If I run code first I have warnings:
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
C:\Users\mceca\Desktop\py.py:9: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend('soundfile')
next when I enter text, I have error:
Traceback (most recent call last):
File "C:\Users\mceca\Desktop\py.py", line 33, in <module>
with torch.no_grad():
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\transformers\models\speecht5\modeling_speecht5.py", line 2921, in generate_speech
return _generate_speech(
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\transformers\models\speecht5\modeling_speecht5.py", line 2521, in _generate_speech
decoder_hidden_states = model.speecht5.decoder.prenet(output_sequence, speaker_embeddings)
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\mceca\AppData\Roaming\Python\Python310\site-packages\transformers\models\speecht5\modeling_speecht5.py", line 700, in forward
speaker_embeddings = speaker_embeddings.expand(-1, inputs_embeds.size(1), -1)
RuntimeError: expand(torch.FloatTensor{[1, 1, 1, 1, 1, 8, 192]}, size=[-1, 1, -1]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (7)
This is my first time to work with torch, torchaudio and huggingface TTS model. Please, modify my code and describe your changes (not only describe).