Strange encoding of a pdf stream

209 Views Asked by At

I'm studying the internal structure of pdf, so i created a file in libreoffice writer, writing only the string "Hello world" and exported it to pdf. So I uncompressed it with: pdftk hello_world.pdf output hello_world_unc.pdf uncompress and opened it with a text editor.

Analyzing the stream I get something strange like this: [<01>5<02>-6<03>2<03>2<040506>-2 <040703>2<08>]TJ which should represent "Hello world" as an array of hexadecimal strings (in the angle brackets), and integers to specify the spacing.

I state that the file contains only this string, created precisely for educational purposes.

The problem is that they don't look like hexadecimal characters to me as they should be. That is, surely the "H" is not represented with 01. I was expecting something like this: (Hello world) Tj.

Can anyone help me understand? Thanks in advance

2

There are 2 best solutions below

0
the busybee On BEST ANSWER

These numbers are just indexes into the character map.

Investigate the uncompressed PDF deeper. And you will find some lines like these:

<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0077>
<07> <0072>
<08> <0064>
0
johnwhitington On
  • kerning is in use, so a TJ array is being used instead of a Tj string. The numbers are kerns measured in 1/1000 of an em (from memory);

  • The <> strings are PDF hex strings, not ordinary PDF strings;

  • Look for a /ToUnicode map in the font. If this exists, it will help you with the mapping from PDF code points to sequences of unicode code points.