I am able to convert a wav file to spectrogram and then back again with an acceptable level of quality. I can plot and save that spectrogram as jpg file, but I have been able to import the jpg and convert it back to audio.
I can convert the audio to a db scaled spectrogram
import librosa
x, sr = librosa.load(librosa.ex('trumpet'))
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
And I am able to convert the db scaled spectrogram back to audio
X2 = librosa.db_to_amplitude(Xdb)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("test1.wav", audio, sr)
I can save the array as a 32bit Tiff, and recreate the audio from that tiff file.
from PIL import Image
import numpy as np
im =Image.fromarray(Xdb).convert('F')
im.save("test.tiff")
img = Image.open("test.tiff")
recspec = np.array(img)
X2 = librosa.db_to_amplitude(recspec)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("test1.wav", audio, sr)
I can plot the db scaled spectrogram and save it as a jpg
from matplotlib import pyplot as plt
import librosa.display
fig = plt.figure(figsize=(10, 10), dpi=1000, frameon=False)
ax = fig.add_axes([0, 0, 1, 1], frameon=False)
ax.axis('off')
librosa.display.specshow(Xdb, sr=sr, cmap='gray', x_axis='time', y_axis='hz')
plt.savefig("test.jpg", bbox_inches=0, pad_inches=0)
But I have been completely unable to figure out how to reimport the jpg in such a way as to recreate the audio from it. I realise is not as simple as just importing the jpg in the same way as the tiff and in saving it as a lossy format like jpg is going to result in some significant loss of quality, but I would be ok with that if the resulting audio at least slightly resembled what went in.I have looked into code to do similar things but their approach has been much more complicated such as using the colour channels to encode phase etc, I have been happy with the quality of the griffinlim reconstruction so am happy to skip that. If someone could point me in the right direction that would be great.
As you have alluded to, finding the waveform from a magnitude spectrogram with Griffin-Lim will have some limitations in fidelity. But if you are happy with the results in that case, then the issue is specific to the JPEG encoding (or decoding).
First, your way of saving the JPEG is wrong. You should not plot the values. But instead save the spectrogram array of values, using PIL. The same way you do for TIFF.
There are two key challenges when encoding a magnitude spectrogram into a JPEG:
Regarding 2) - turn off all compression in the start. You can try to re-introduce it later, but get the simple case working first.
Regarding 1. You must make sure your spectrogram values fit into the range 0-255. A good starting point is to decibel-scale the spectrogram (using for example
librosa.power_to_db()), and then use a linear mapping between the values you get and 0-255. The key to decoding the spectrogram later is to know these values, so you can reverse the process. This can either be done by having fixed/hard-coding scaling values, but it might be tricky to find values that work for all audio/spectrogram input. Or you can store the scaling factors as metadata in JPEG, using a custom EXIF tag.Regarding 3) Make sure that you are not saving color JPEGs. Instead, use a single greyscale/luminosity channel.
PNG would be a bit less limiting, if you do not get appropriate quality from JPEG. As it supports 16-bit values, and uses lossless compression.
I have stored magnitude spectrograms successfully in JPEG files. However, we never did convert it back to audio. Instead, it was used as spectrogram values for machine learning and computing acoustical parameters such as short-time sound levels.