I found out that some of my '.wav' files are badly written. Given the following comparison between 'corrupted_file.wav' and 'ok_file.wav', this is what I get when i try to read 'corrupted_file.wav' using standard libraries such soundfile, wave, librosa etc.
RuntimeError: Error opening 'corrupted_file.wav': File contains data in an unknown format.
So I tried to understand what was the issue using:
with open('corrupted_file.wav', 'rb') as audiofile:
corrupted_f = audiofile.read()
with open('ok_file.wav', 'rb') as audiofile:
ok_f = audiofile.read()
print(corrupted_f[:40])
print(ok_f[:40])
That's what I get:
b'\xff\xfb\x90\xc4\x00\x03\x12\xa9\xa3\x16g\xb0\xc9B\xf4\xb4e\xcd\x94\x9a8\x00\x12\x93\x95\xc5F~\x1e\xa71\xd2q\x18\xa58\xeb\x01\x82\x19'
b'RIFF$`\x08\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\x80\xbb\x00\x00\x00\xee\x02\x00\x04\x00\x10\x00data'
As you can see, 'corrupted_file.wav' does not satisfy WAVE standards as it does not present relevant chunks as 'RIFF', 'WAVEfmt' and 'data'. by the way Windows 10 is able to play it with its internal application. If I use a standard audio converter to export 'corrupted_file.wav' as a WAVE file, I get 'converted_corrupted_file.wav' whose representation becomes:
b'RIFFF\x16\x11\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x02\x00D\xac\x00\x00\x10\xb1\x02\x00\x04\x00\x10\x00LIST\x1a\x00\x00\x00INFOISFT\x0e\x00\x00\x00Lavf59.27.100\x00data\x00\x16\x11\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\...
Which for me is not easily linkable to the corrupted one, so I cannot recovery it with a homemade function.
I checked the file's format using several online tools and the result it's always the same: "does not match any of the known formats". But windows 10 is still able to decode and play it so I guess there should be a trick.
I could semi-manually map and move every corrupted file into a temporary folder so that I could import them into an audio converter that operates the conversion from '.wav' to '.wav'. Then I could move the new versions to their original position while deleting the corrupted counterparts.
But how can I automatically recover 'corrupted_file.wav' with a python written function? I need it as the amount of corrupted files is thousands.
It looks like an MP3 file (because it starts with
b'\xff'). Rename it to file.mp3, use a tool to convert it to file.wav, and open file.wav from Python.