Decoding Thrift Object what are these extra bytes?

726 Views Asked by At

I'm working on writing a pure JS thrift decoder that doesn't depend on thrift definitions. I have been following this handy guide which has been my bible for the past few days: https://erikvanoosten.github.io/thrift-missing-specification/

I almost have my parser working, but there is a string type that throws a wrench into the program, and I don't quite understand what it's doing. Here is an excerpt of the hexdump, which I did my best to annotate:

Correctly parsing:

000001a0  0a 32 30 32 31 2d 31 31  2d 32 34 16 02 00 18 07  |.2021-11-24.....|
........................blah blah blah............|  |  |
                                       Object End-|  |  |
                           0x18 & 0xF = 0x8 = Binary-|  |
             The binary sequence is 0x7 characters long-|
000001b0  53 65 61 74 74 6c 65 18  02 55 53 18 02 55 53 18  |Seattle..US..US.|
          S  E  A  T  T  L  E  |___|  U  S  |___| U  S
    Another string, 2 bytes long |------------|

So far so good.

But then I get to this point: There string I am trying to extract is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4592.0 Safari/537.36 Edg/94.0.975.1" and is 134 bytes long.

000001c0  09 54 61 68 6f 65 2c 20  43 41 12 12 00 00 08 c8  |.Tahoe, CA......|
                                 Object ends here-|  |  |
                           0x8 & 0xF = 0x8 = Binary -|  |
                                  0xc8 bytes long (200)-|
000001d0  01 86 01 4d 6f 7a 69 6c  6c 61 2f 35 2e 30 20 28  |...Mozilla/5.0 (|
          |  |  |  M  o  z  i  l   l  a  
        ???? |--|-134, encoded as var-int
000001e0  4d 61 63 69 6e 74 6f 73  68 3b 20 49 6e 74 65 6c  |Macintosh; Intel|

As you can see, I have a byte sequence 0x08 0xC8 0x01 0x86 0x01 which contains the length of the string I'm looking for, is followed by the string I'm looking for but has 3 extra bytes that are unclear in purpose.

The 0x01 is especially confusing as it neither a type identifier, nor seems to have a concrete value.

What am I missing?

2

There are 2 best solutions below

2
Slava Knyazev On BEST ANSWER

The byte sequence reads as follows

  • 0x08: String type, the next 2 bytes define the elementId
  • 0xC8 0x01: ElementId, encoded in 16 bits
  • 0x86 0x01: String length, encoded as var int

It turns out that if the type identifier does not contain bits defining the elementId, the elementId will be stored in the next 2 bytes.

4
codeSF On

Thrift supports pluggable serialization schemes. In tree you have binary, compact and json. Out of tree anything goes. From the looks of it you are trying to decode compact protocol, so I'll answer accordingly.

Everything sent and everything returned in a Thrift RPC call is packaged in a struct. Every field in a struct has a 1 byte type and a 2 byte field ID prefix. In compact protocol field ids, when possible, are delta encoded into the type and all ints are compressed down to just the bits needed to store them (and some flags). Because ints can now take up varying numbers of bytes we need to know when they end. Compact protocol encodes the int bits in 7 bits of a byte and sets the high order bit to 1 if the next byte continues the int. If the high order bit is 0 the int is complete. Thus the int 5 (101) would be encoded in one byte as 0000101. Compact knows this is the end of the int because the high order bit is 0.

In your case, the int 134 (binary 10000110) will need 2 bytes to encode because it is more than 7 bits. The fist 7 bits are stored in byte 1 with the 0x80 bit set to flag "the int continues". The second and final byte encodes the last bit (00000001). What you thought was 134 was just the encoding of the first seven bits. The stray 1 was the final bit of the 134.

I'd recommend you use the in tree source to do any needed protocol encoding/decoding. It's already written and tested: https://github.com/apache/thrift/blob/master/lib/nodejs/lib/thrift/compact_protocol.js