I have an excel file which contains PDF - embedded (attached) in it.
I am trying to use PHPExcel and PHPSpreadsheet to fetch the data. I am successful in fetching the images but other objects like PDF are not accessible
My first try is using PHP but I am also fine if its possible with Python
XLSX is a Zip container of Excel components so we can open the zip file and manipulate the contents.
Our Objects of interest are in the "embeddings" folder and if there is only one embedding it is easy to extract as oleObject1.bin so one line to extract and one line to start editor or your customised python find and save.
In that BIN file we can file seek the address of the PDF header
%PDF-here at 00002240Also file seek its EOF @ 00004794
%%EOF\x0ANow using any method such as Heads and Tails, splice out that PDF in this case 2554 bytes and save as BINary.pdf
I wrote a script to extract a PDF from an office bin file on Windows OS so after un TAR, Windows users can run this script. NOTE it has 2 small .exe dependencies you need to download and specify a path so see and edit start of file. For PHP you should be able to emulate that in Python so for starters see https://stackoverflow.com/a/56742848/10802527