Strange encoding of a pdf stream

209 Views Asked by Teo7 At 31 March 2023 at 16:20

I'm studying the internal structure of pdf, so i created a file in libreoffice writer, writing only the string "Hello world" and exported it to pdf. So I uncompressed it with: pdftk hello_world.pdf output hello_world_unc.pdf uncompress and opened it with a text editor.

Analyzing the stream I get something strange like this: [<01>5<02>-6<03>2<03>2<040506>-2 <040703>2<08>]TJ which should represent "Hello world" as an array of hexadecimal strings (in the angle brackets), and integers to specify the spacing.

I state that the file contains only this string, created precisely for educational purposes.

The problem is that they don't look like hexadecimal characters to me as they should be. That is, surely the "H" is not represented with 01. I was expecting something like this: (Hello world) Tj.

Can anyone help me understand? Thanks in advance

Original Q&A

There are 2 best solutions below

the busybee On 31 March 2023 at 17:41 BEST ANSWER

These numbers are just indexes into the character map.

Investigate the uncompressed PDF deeper. And you will find some lines like these:

<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0077>
<07> <0072>
<08> <0064>

johnwhitington On 31 March 2023 at 18:17

kerning is in use, so a TJ array is being used instead of a Tj string. The numbers are kerns measured in 1/1000 of an em (from memory);
The <> strings are PDF hex strings, not ordinary PDF strings;
Look for a /ToUnicode map in the font. If this exists, it will help you with the mapping from PDF code points to sequences of unicode code points.

Strange encoding of a pdf stream

There are 2 best solutions below

Related Questions in PDF

Related Questions in ITEXT

Related Questions in PDF-GENERATION

Related Questions in LIBREOFFICE

Related Questions in PDFTK

Trending Questions

Popular # Hahtags

Popular Questions