https://github.com/itext/itext7/blob/develop/CONTRIBUTING.md says this is the place to report itext7 bugs, so there you go.
Observable behaviour
Using itext7 version 8.0.0
My PDF document includes several instances of /Span << /ActualText <> >>
<> is valid syntax for a hex-encoded, zero-length PdfString; however, it is parsed into a PdfString instance with content of 128 zero bytes, hexWriting=true, and value of an empty string.
For this instance, GetValue() correctly returns the empty string, but ToUnicodeString() is essentially calling PdfTokenizer.DecodeStringContent(new byte[128], true) which returns a byte[64] having every element set to 239. This is further converted into a string of 64 ï characters, which is what ToUnicodeString() returns.
As CanvasTag.GetActualText calls ToUnicodeString(), so it uses the messed-up 64-character string instead of the empty string.
Reproducer
Using C# immediate window
var str = new iText.Kernel.Pdf.PdfString("");
Expression has been evaluated and has no value
str.SetHexWriting(true);
{}
content: null
decryptInfoGen: 0
decryptInfoNum: 0
decryption: null
directOnly: false
encoding: null
hexWriting: true
indirectReference: null
state: 0
value: ""
str.ToUnicodeString()
"ïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïï"
Root cause
protected internal virtual byte[] EncodeBytes(byte[] bytes) {
if (hexWriting) {
ByteBuffer buf = new ByteBuffer(bytes.Length * 2);
foreach (byte b in bytes) {
buf.AppendHex(b);
}
return buf.GetInternalBuffer();
}
This creates, for a zero-length PdfString, a new ByteBuffer(0).
In https://github.com/itext/itext7-dotnet/blob/develop/itext/itext.io/itext/io/source/ByteBuffer.cs#L39 :
public ByteBuffer(int size) {
if (size < 1) {
size = 128;
}
buffer = new byte[size];
}
This means that, for a zero-length PdfString, the created buffer is a byte[128].
Suggested fix
Adding if (bytes.Length == 0) return bytes; into EncodeBytes would do just fine.
Unfortunately, no workaround is possible without modifying itext7 core.
I'm not posting this as a PR because my PRs at https://github.com/itext/i7n-pdfocr/pulls haven't received any attention since 2020. Hopefully, shaped as a bug report, it may get a bit more attention from itext7 team.