I am creating a parser which would extract the Exif Meta Data from an image, I followed this documentation to parse the exif data from a JPEG file and it worked,
// Here is the first 16 bytes of exif which is mention in the documentation
45 78 69 66 (Exif)
00 00 (padding)
49 49 (byte align)
2A 00 (byte align)
08 00 00 00 (offset)
0C 00. (number of attribute)
Then i started to extract the Exif Data from .webp file format but i didn't had any webp image with exif data present so i went to this exif Editor and inserted the Exif data, but it inserted the data in a different format
45 58 49 46 (EXIF) // note that this header is different from the previous one
1E 02 (chunk size big-endian)
00 00 (padding maybe or could be part of chunk size)
4D 4D (byte align)
00 2A (byte align)
00 00 00 10 (offset)
45 78 69 66 4D 65 74 61 (ExifMeta). // i don't know why this is there
I thought that maybe the website is inserting the Data Wrong so i went and viewed the exif data present in the image using a online exif view, and it was there, so i dont understand why there are two different structure of storing EXIF data, and where can i find the documentation on how to parse the 2nd type of Exif data
You cannot blindly scan a file for
45 78 69 66or45 78 69 66 00 00to then expect a full Exif metadata structure, since other things in each file might have those byte sequences by conincidence - most likely through user texts/comments. You have to treat each file as per its format.JFIF = JPEG file interchange format
It has its own format, not used by anyone else.
With that knowledge you parse a JFIF file, iterating through all the segments. One (or multiple or zero) of the segments is called APP1, identified by
FF E1followed by two bytes for the size, followed by the actual payload bytes. Each APP1 segment is identified by its first bytes until the first00byte - in this case it'sExif\0\0(45 78 69 66 00 00), where the first00byte is the identification termination and only the second00byte is for padding reasons. Other possible identifications are:http://ns.adobe.com/xap/1.0/\0for XMPhttp://ns.adobe.com/xmp/extension/\0for Extended XMP (continuing a previous APP1 segment, since segments cannot exceed 65533 bytes of payload)QVCI\0for some Casio productsFLIR\0G3FAX(yes, without terminating null byte)PARROT\0And after that identification the overall Exif metadata payload starts.
WebP = Web Picture
It uses RIFF, which is known for decades other file formats like WAV or AVI. It is Microsoft's adaption of IFF and Apple's QTFF - all 3 are quite similar:
With that knowledge you parse a RIFF file, iterating through all the chunks. One (or multiple or zero) of the chunks have the identification
EXIF(45 58 49 46) and its payload is then the Exif metadata.TIFF = Tagged Image File Format
This is not only a file format on its own, but also used entirely for Exif:
See the official TIFF 6.0 specification for how to parse this format (cannot find where Adobe stores it currently). Also Exif has an official documentation. Parsing this format is a bit more challenging than parsing JFIF or RIFF.
Conclusion
The bottom line is: don't confuse multiple formats - parse them separately. If you can parse JFIF and RIFF you should be able to extract the identical payload of byte of the Exif metadata. Parsing Exif should be done separately again, just like one would parse a TIFF file.
Exif can also reside in other files:
eXIfchunk, or the chunk identificationszXIforzxIffor compressed Exif data. It could also be one of theiTXt,zTXtortEXttext chunks with a keyword ofexiforAPP1.0x0423have Exif payload.exif,Exifandexfcindicate Exif payload. So does a UUID atom/box with the value0537cdab-9d0c-4431-a72a-fa561f2a113e.As you see: scanning for
Exifalone would not find every occurance and may also produce false positives. It's by far more robust to just adhere to each file's format. Those formats have all their advantages and disadvantages, and also different and even multiple ways how Exif metadata can be stored in there. One way to better understand all this is to generate Exif metadata in each file format with a long and unique UserComment text that you can easily spot in any of the files.You don't need to know: the offset tells you where to look next. In your previous file it was
08 00 00 00(different endianess), indicating to look at offset8, which is just the next byte (since the it is inclusive, counting the header of 8 bytes already). In this file it is16, which means the next 8 bytes are undefined for various reasons (which can also be used stuffing in some kind of advertizement). Just skip it and off you go for the actual TIFF/Exif content.