I plan on using HTML Purify for the outputs of my webservice. I did not see an integrated "loggin" functionality to check what is replaced, so I wrote it myself.
However, the purifier() function automatically transforms my special character "entities".
For example:
& -> &
ø -> ø
The problem is now, that these will also be "logged" as my logging function compares the differences between the "purified" string and the original one. Is there a way to avoid this automatic encoding/decoding, or does anyone have better idea of how to check what is actually replaced?
Thank you!
The two examples you cite are actually two different use-cases; the one is because HTML Purifier is making your output safe (
& -> &), the other is HTML Purifier using UTF-8 instead of entities because that's its internal representation.Generally speaking, if your HTML is safe, HTML Purifier will output semantically equivalent HTML, it's not actually guaranteed to keep e.g. all whitespace or representation, because its focus is entirely on security, not idempotence for safe HTML, and it transforms incoming HTML quite heavily in the interest of thorough analysis.
You could force it to always turn all non-ASCII characters into entities with Core.EscapeNonASCIICharacters, but I doubt that's what you want - it will also change any UTF-8 that's not currently an entity into an entity. It also doesn't solve that unescaped HTML special characters will be escaped (
& -> &) - HTML Purifier doesn't take chances, so even those HTML special characters that are coincidentally/contextually safe will always be encoded.Instead, take a look at Core.CollectErrors. That should enable checking for the changes that you're looking for. Despite the warning in the docs, it is a solid feature. You can see an example usage of that feature here. The tl;dr is that to get the error collector, you use
$purifier->context->get('ErrorCollector');, and to get your list of errors (which includes replacements),$errorCollector->getRaw(). Try that and see if it works?