finding the IFDs in exif data

135 Views Asked by At

i want to evaluate some EXIF data in awk, and i have this documentation
https://web.archive.org/web/20131111073619/http://www.exif.org/Exif2-1.PDF
but i get stuck by evaluating an example EXIF

the documentation says:
IFD structure
bytes 0 -1 Tag
Bytes 2 -3 Type
Bytes 4 -7 Count
Bytes 8 -11 Value Offset

Type
The following types are used in Exif:
1 = BYTE An 8-bit unsigned integer.,
2 = ASCII An 8-bit byte containing one 7-bit ASCII code. The final byte is terminated with NULL.,
3 = SHORT A 16-bit (2 -byte) unsigned integer,
4 = LONG A 32-bit (4 -byte) unsigned integer,
5 = RATIONAL Two LONGs. The first LONG is the numerator and the second LONG expresses the denominator.,
7 = UNDEFINED An 8-bit byte that can take any value depending on the field definition,
9 = SLONG A 32-bit (4 -byte) signed integer (2's complement notat ion),
10 = SRATIONAL Two SLONGs. The first SLONG is the numerator and the second SLONG is the denominator

My EXIF example data length is 14062 bytes it starts with:

45 78 69 66 00 00 49 49 2a 00 08 00 00 00 0b 00 0e 01 02 00 20 00 00 00 92 00 00 00 0f 01 02 00 05 00 00 00 b2 00 00 00 10 01 02 00 07 00 00 00 b8 00 00 00 12 01 03

45 78 69 66 00 00 -> Exif\x00\x00 exif header

then the tiff header
49 49 -> II means little Endian
2a 00 -> 002a -> 42 tiff file marker
08 00 00 00 -> 8 -> offset for the first IFD to the tiff header

so the first IFD is 12 bytes long and starts with at offset 8:

0b 00 0e 01 02 00 20 00 00 00 92 00

if i evaluate this as IFD structure i get:

0b 00 tag -> 00 0b
0e 01 field type -> 01 e1 -> decimal 481 - but there are only fieldtypes from 1 to 10 ??
02 00 20 00 counter -> 20 00 02-> decimal 2097154 - my whole exif data is 14062 bytes lol
00 00 92 00 offset next IFD -> 92 00 00 -> decimal 9568256 bigger than exif data...wtf

please help me - something is wrong here, where starts the first ifd?

1

There are 1 best solutions below

0
Schmaehgrunza On

i finished this and write myself an answer.
1) How to get the EXIF data from an jpeg
The whole jpeg is structured with markers. A marker starts with Byte 0xFF followed by a next byte, which indentifies the marker. for example:
0xFFD8 ... Start of Image, 0xFFDA ... Start of Scan (short segment and compressed data is following), ....
or the APP markers from 0xFFE0 to 0xFFEF.
After the marker a segment of bytes is appended, which starts with two bytes of segment size.
segment_size= Byte 1*256+Byte 2, including the size bytes themselves.
Not all markers are followed by a segment, but the most!
For example: 0xFFD8 ... Start Of Image is not followed by a segment.
But thats the only marker in the jpeg header (0xFFD8 - to marker 0xFFDA).

The Exif data is stored in APP marker E1 -> 0xFFE1, which normally directly follows the marker 0xFFD8 (Start Of Image).

gawk example reading the Exif data from image - splitting the data at every byte \xFF , so that the complete image is not loaded at once:

awk -b '
BEGIN {
FS="^$"; RS="\xFF";                              #-- setting Field and Record Separator
for (i=0;i<256;i++) dec[sprintf("%c",i)]=i;      #-- building byte to decimal
}
#-- converts bytes (string of bytes eg. "\x05\x06\x0A") to a HEX string, each byte separated by delimiter

function bytes_TO_hex (bytes, delimiter,     i, l, hexSTR)
   {
   l=length(bytes); hexSTR=""
   for (i=1;i<=l;i++) hexSTR= hexSTR sprintf("%02X", dec[substr(bytes, i, 1)]) delimiter;
   return substr(hexSTR, 1, length(hexSTR)-length(delimiter));
   }

#-- gets the next bytes by loading the next record, adding RT (record splitter) to the start position of the next record, as long as wanted count is smaller than next bytes count
function get_nextBytes_LR (count)
   {
   nextBytes="";
   while (length(nextBytes)<count && (getline recordBytes)>0) nextBytes= nextBytes RT recordBytes;
   return nextBytes;
   }

{            
if (length($0)==0) next;   #-- because JPEG file starts with \xFF, the first record of awk (record splitting "\xFF") is an empty string

jpeg_marker=substr ($0,1,1);

switch (jpeg_marker)
   {
   case "\xD8": break;        #-- \xD8 Start Of Image - no segment following
   case "\xDA": exit;         #-- \xDA Start Of Scan  - short info segment and compressed image data following
   
   #-- segment following - getting the segment size and get bytes til end of segment
   default:
      if (length($0)<3) $0=$0 get_nextBytes_LR(2);                    #-- can happen, when the segment counter bytes followed the jpeg_marker contain FF values                                           
      segment_byte1=substr($0,2,1); segment_byte2=substr($0,3,1);     #-- get first segment byte
      segment_size=dec[segment_byte1]*256 +dec[segment_byte2];        #-- get second segment byte
         
      segment_missing_bytecount=segment_size-(length($0)-1);
      segment_bytes_postloaded=get_nextBytes_LR(segment_missing_bytecount);    #-- load missing bytes for segment
      
      if (length(segment_bytes_postloaded) < segment_missing_bytecount)
         {
         print "JPEG error: image corrupt - end of image reached, marker segment size exceeds image size!" >"/dev/stderr";
         exit;
         }
      if (length(segment_bytes_postloaded)>segment_missing_bytecount)
         {               
         print "JPEG error: image corrupt - end of segment is not followed by the next marker FF..!" >"/dev/stderr";
         exit;
         }
      
      #-- if parser is here -> segment was read without faults

      if (jpeg_marker=="\xE1")
         {
         segment=substr($0 segment_bytes_postloaded, 4,segment_size-2); #--build segment
         print bytes_TO_hex(segment);
         exit;
         }        
   }
}' $1 #-- $1 your JPEG image

The Exif header contains 6 bytes 45 78 69 66 00 00-> Exif\x00\x00 and is followed by the Tiff header.
The Exif properties are structured in IFDs Internal Field Directories, which is a concept of Tiff, Tiff files, so thats why an Tiff header follows.

2) Tiff header and IFDs
The Tiff header is 8 byte long.
Bytes 0-1: little Endian “II”(0x4949) or Big Endian “MM” (0x4D4D)
----- the following bytes depend on Bytes 0-1 little Endian or big Endian
Bytes 2-3: 42(decimal) ... a number, which identifies TIFF
Bytes 4-7: offset value to the Tiff header (not Exifheader!), where IFD 0 starts, which is 8(decimal).

The first IFD you can read is IFD 0 and follows directly after the Tiff header!
In the Internal Field Directory, fields of bytes are described (offset, how to read it out), which contain the values of the properties. This fields of bytes are in the data segment directly following the IFD. Some of the property values are directly written into the IFD, if they are small enough to fit in the 4 bytes valueoffset for a field descriptor in the IFD.

IFD structure
The first two bytes of the IFD, contain the number of fields.
Byte 0-1: number of fields
---- then followed by an array of 12 bytes for every field descriptor
Byte 02-13: first field descriptor
Byte 14-25: second field descriptor
----- and so on
----- finishing with an 4 byte offset value for the next IFD
Byte (n-3)-n: offset value for the next IFD, which is for IFD 0, IFD 1
----- if no IFD is following, offset value 0(dec) is entered.

IFD field descriptor
Bytes 00-01: Exif Property Tag
Bytes 02-03: Type (ASCII 8bit, SHORT 16bit, Long 32bit, ...)
Bytes 04-07: Count - this is not an byte count, its a word count, how many SHORTS, LONGS, ASCIIs
Bytes 08-11: valueoffset - value or offset
--- if the value is short enough to fit in the 4 bytes e.g. (four ASCII chars), than the value is directly written into the Bytes 08-11, if the value is bigger, than the Bytes 08-11 contain the offset value to the byte field in the data segment following the IFD.
If the value is smaller than 4 bytes, than it is written in from the left side. For example a 3 byte value uses the bytes 08-10, one byte -> 08, two bytes 08-09
a simple calculation counter * bytes of type<= 4 byte : valueoffset => value

Field Types
1 = BYTE An 8-bit unsigned integer
2 = ASCII An 8-bit byte containing one 7-bit ASCII code. The final byte is terminated with NULL
3 = SHORT A 16-bit (2 -byte) unsigned integer
4 = LONG A 32-bit (4 -byte) unsigned integer
5 = RATIONAL Two LONGs. The first LONG is the numerator and the second LONG expresses the denominator
7 = UNDEFINED An 8-bit byte that can take any value depending on the field definition
9 = SLONG A 32-bit (4 -byte) signed integer (2's complement notat ion)
10 = SRATIONAL Two SLONGs. The first SLONG is the numerator and the second SLONG is the denominator

How to find other IFDs than IFD 0, IFD 1
There are a lot of Exif Property tags, which are pointers to other IFDs.
So the value of these tags contain an offset value for an other IFD.
e.g. tag 0x8769 - Exif IFD, tag 0x8825 - GPS IFD, tag 0xA005 - InterOp IFD
go to https://exiftool.org/TagNames/EXIF.html - search for string "-->" in the table, which shows you all pointers.
So you have to search through the field descriptors in IFD 0 for these property tags, and in the other IFDs you are pointed to, to get all IFDs.

last an awk example - reads out IFD 0, 1 and searches for GPS, Exif, Interop IFD tags
look at functions evaluate_EXIF, read_IFD, get_IFD_field_value,

awk -b '
BEGIN {
FS="^$"; RS="\xFF";                              #-- setting Field and Record Separator
for (i=0;i<256;i++) dec[sprintf("%c",i)]=i;      #-- building byte to decimal

#-- building exif_fieldtype_infos [Number Of Bytes]-[Unsigned 0|Signed 1]-[Normal 0|Fraction 1]
exif_fieldtype_infos["\x01"]= "1-0-0";
exif_fieldtype_infos["\x02"]= "1- - ";
exif_fieldtype_infos["\x03"]= "2-0-0";
exif_fieldtype_infos["\x04"]= "4-0-0";
exif_fieldtype_infos["\x05"]= "8-0-1";
exif_fieldtype_infos["\x07"]= "1- - ";
exif_fieldtype_infos["\x09"]= "4-1-0";
exif_fieldtype_infos["\x0A"]= "8-1-1"; 
}

#-- gets the next bytes by loading the next record, adding RT (record splitter) to the start position of the next record, as long as wanted count is smaller than next bytes count
function get_nextBytes_LR (count)
   {
   nextBytes="";
   while (length(nextBytes)<count && (getline recordBytes)>0) nextBytes= nextBytes RT recordBytes;
   return nextBytes;
   }

#--- clears leading null bytes and reverses bytes(string of bytes eg. "\x05\x06\x0A"), if reverse is true.
function bytes_clearLNulls (bytes, bigEndian, reverse,     i, l, bytes_reverse)
   {
   l=length(bytes);
   if (bigEndian)
      {
      for (i=1;i<l;i++) if (substr(bytes,i,1)!="\x00") break;
      bytes=substr(bytes,i);
      }
   else
      {
      for (i=l;i>1;i--) if (substr(bytes,i,1)!="\x00") break;
      bytes=substr(bytes,1,i);
      }
   if (reverse)
      {
      l=length(bytes); bytes_reverse="";
      for (i=1;i<=l;i++) bytes_reverse=substr(bytes, i, 1) bytes_reverse;
      return bytes_reverse;
      }
   else return bytes;
   }

#-- converts bytes (string of bytes eg. "\x05\x06\x0A") to a signed or unsigned decimal number
function bytes_TO_number (bytes, bigEndian, signed,      i, l, byte_inDEC, number, weight)
   {
   l=length(bytes); number=0; weight=1;
   if (bigEndian)
      {
      for (i=l;i>0;i--)
         {
         byte_inDEC=dec[substr(bytes, i, 1)];
         number+=byte_inDEC*weight;
         weight*=256;
         }
      }
   else
      {
      for (i=1;i<=l;i++)
         {
         byte_inDEC=dec[substr(bytes, i, 1)];
         number+=byte_inDEC*weight;
         weight*=256;
         }   
      }
   if (signed && byte_inDEC >= 128) number-=weight;
   return number;
   }
   
#-- converts bytes (string of bytes eg. "\x05\x06\x0A") to a HEX string, each byte separated by delimiter
function bytes_TO_hex (bytes, delimiter,     i, l, hexSTR)
   {
   l=length(bytes); hexSTR=""
   for (i=1;i<=l;i++) hexSTR= hexSTR sprintf("%02X", dec[substr(bytes, i, 1)]) delimiter;
   return substr(hexSTR, 1, length(hexSTR)-length(delimiter));
   }  

#-- reads the IFD into the argument variable overhanded for parameter IFD. you get an associative array with properties ["fields_L"] .. length of fields, ["fields"] .. fields array, ["next"] .. offset value for next IFD
function read_IFD (IFD, offset, bytes_ARR, bigEndian,    field_i, fields_l, byte_i)
   {
   #-- byte_i .. byte index (bytes_ARR), field_i .. field index
   IFD["fields_L"]=fields_l=bytes_TO_number(bytes_ARR[offset] bytes_ARR[offset+1], bigEndian, 0);
   byte_i=offset+2;

   for (field_i=1;field_i<=fields_l;field_i++)
      {
      IFD["fields"][field_i]["tag"]=bytes_clearLNulls(bytes_ARR[byte_i] bytes_ARR[byte_i+1], bigEndian, !bigEndian);
      IFD["fields"][field_i]["type"]=bytes_clearLNulls(bytes_ARR[byte_i+2] bytes_ARR[byte_i+3], bigEndian);
      IFD["fields"][field_i]["counter"]=bytes_TO_number(bytes_ARR[byte_i+4] bytes_ARR[byte_i+5] bytes_ARR[byte_i+6] bytes_ARR[byte_i+7], bigEndian);
      IFD["fields"][field_i]["valueoffset"]=bytes_ARR[byte_i+8] bytes_ARR[byte_i+9] bytes_ARR[byte_i+10] bytes_ARR[byte_i+11];     #-- can be value or offset
      byte_i+=12;
      }
   IFD["next"]=bytes_TO_number(bytes_ARR[byte_i] bytes_ARR[byte_i+1] bytes_ARR[byte_i+2] bytes_ARR[byte_i+3], bigEndian);
   }

 #-- returns the value of an IFD field as string - IFD fields are returned by function read_IFD 
function get_IFD_field_value (field, bytes_ARR, bigEndian,    bytes, is_valueoffset_value, field_bytelength, value, counter, byte_i, fieldtype_info, i, word_l, word, signed, fractional, number_delimiter)
   {
   value=""; counter=field["counter"]; is_valueoffset_value=0; number_delimiter=";";
  
   split(exif_fieldtype_infos[field["type"]], fieldtype_info, "-");              #-- get fieldtype info -- [1] number of bytes (word), [2] unsigned|signed, [3] normal|fractional
   word_l=fieldtype_info[1];
   if (fieldtype_info[2]==" ") signed=" "; else signed=strtonum(fieldtype_info[2]);
   if (fieldtype_info[3]==" ") fractional=" "; else fractional=strtonum(fieldtype_info[3]);
   
   field_bytelength=counter*word_l;                                             
   if (field_bytelength<=4)                                                      #-- check, if field["valueoffset"] contains the value itself or an offset value;
      {                                                                          #-- contains the value itself use bytes string
      bytes=substr(field["valueoffset"], 1, field_bytelength);                   #-- cut from the left side til field_bytelength
      is_valueoffset_value=1;
      byte_i=1;
      }
   else byte_i=bytes_TO_number(field["valueoffset"], bigEndian);                 #-- field["valueoffset"] contains offset value, use bytes_ARR

   #-- i .. word byte index, byte_i .. byte index (bytes_ARR) or index of bytes String, counter .. word counter
   for (;counter>0;counter--)
      {
      word="";
      if (is_valueoffset_value) { word=substr(bytes, byte_i, word_l); byte_i+=word_l; }      #-- building word from bytes of field["valueoffset"]
      else for (i=1; i<=word_l; i++) { word=word bytes_ARR [byte_i]; byte_i++; }             #-- building word from bytes_ARR
      
      if (signed==" ") value= value word;  #-- -> STRING - word=character, character is added to string
      else                                 #-- -> NUMBER - if more numbers per field (counter > 1), numbers are separated with ";" - look at number_delimiter
         {
         if (counter==1) number_delimiter="";
         if (fractional) value= value bytes_TO_number(substr(word, 1, word_l/2), bigEndian, signed) "/" bytes_TO_number(substr(word, word_l/2+1), bigEndian, signed) number_delimiter;
         else value= value bytes_TO_number(word, bigEndian, signed) number_delimiter;
         }
      }
   return value;
   }

function evaluate_EXIF (data,    j, found_IFDs, IFD, IFD_offsets, ifd_name, bigEndian, bytes_ARR, fields_i, fields_L, tag, value, unknown)
    {
    unknown=1;                                                                                  #-- for unkown ifds
    
    if (substr(data,7,2)=="MM") bigEndian=1;                                                    #-- EXIF offset 0006 check for Big Endian
   else bigEndian=0;

    if (dec[bytes_clearLNulls(substr(data,9,2), bigEndian)] == 42)                              #-- EXIF offset 0008 check for TIFF marker number 42
       {
       #-- build bytes array, so that offset values from IFDs can be used directly. awk arrays and strings start with index 1 not with 0.
       #-- the TIFF header starts at string position 7 (EXIF offset 0006), so building an array from string position 8 til end would give us an array position 0 of string position 7 (tiff header start)
       split(substr(data, 8), bytes_ARR,"");
            
      IFD_offsets[1]=bytes_TO_number(substr(data, 11, 4), bigEndian)                           #-- EXIF offset 000A - offset value of the 0th ifd
      #-- associative array with IFD offset as key and IFD name as value
      found_IFDs[IFD_offsets[1]]="IFD 0";       
      
      for (j=1; j<=length(IFD_offsets); j++)
         {
         read_IFD(IFD, IFD_offsets[j], bytes_ARR, bigEndian);

         fields_L=IFD["fields_L"];
         for (fields_i=1; fields_i<=fields_L; fields_i++)
            {
            tag=IFD["fields"][fields_i]["tag"]; value=get_IFD_field_value(IFD["fields"][fields_i], bytes_ARR, bigEndian);
            
            #-- search for ifd pointers
            switch (tag)
               {
               case "\x87\x69": if (!(value in found_IFDs)) { IFD_offsets[length(IFD_offsets)+1]=value; found_IFDs[value]="Exif IFD";} break;
               case "\x88\x25": if (!(value in found_IFDs)) { IFD_offsets[length(IFD_offsets)+1]=value; found_IFDs[value]="GPS IFD";} break;
               case "\xA0\x05": if (!(value in found_IFDs)) { IFD_offsets[length(IFD_offsets)+1]=value; found_IFDs[value]="InterOP IFD";} break;
               }
               
            print found_IFDs[IFD_offsets[j]] " - " bytes_TO_hex(tag) " = " value;
            }
         
         if (IFD["next"]>0 && !(IFD["next"] in found_IFDs))
            {
            if (found_IFDs[IFD_offsets[j]]=="IFD 0") ifd_name="IFD 1";
            else { ifd_name="unknown " unknown++; }

            IFD_offsets[length(IFD_offsets)+1]=IFD["next"];
            found_IFDs[IFD["next"]]=ifd_name;
            }
         delete IFD;
         }
       }
    }

{            
if (length($0)==0) next;   #-- because JPEG file starts with \xFF, the first record of awk (record splitting "\xFF") is an empty string

jpeg_marker=substr ($0,1,1);

switch (jpeg_marker)
   {
   case "\xD8": break;        #-- \xD8 Start Of Image - no segment following
   case "\xDA": exit;         #-- \xDA Start Of Scan  - short info segment and compressed image data following
   
   #-- segment following - getting the segment size and get bytes til end of segment
   default:
      if (length($0)<3) $0=$0 get_nextBytes_LR(2);                    #-- can happen, when the segment counter bytes followed the jpeg_marker contain FF values                                           
      segment_byte1=substr($0,2,1); segment_byte2=substr($0,3,1);     #-- get first segment byte
      segment_size=dec[segment_byte1]*256 +dec[segment_byte2];        #-- get second segment byte
         
      segment_missing_bytecount=segment_size-(length($0)-1);
      segment_bytes_postloaded=get_nextBytes_LR(segment_missing_bytecount);    #-- load missing bytes for segment
      
      if (length(segment_bytes_postloaded) < segment_missing_bytecount)
         {
         print "JPEG error: image corrupt - end of image reached, marker segment size exceeds image size!" >"/dev/stderr";
         exit;
         }
      if (length(segment_bytes_postloaded)>segment_missing_bytecount)
         {               
         print "JPEG error: image corrupt - end of segment is not followed by the next marker FF..!" >"/dev/stderr";
         exit;
         }
      
      #-- if parser is here -> segment was read without faults

      if (jpeg_marker=="\xE1")
         {
         segment=substr($0 segment_bytes_postloaded, 4,segment_size-2); #--build segment
         evaluate_EXIF(segment);
         exit;
         }        
   }
}' $1 #-- $1 your JPEG image