I'm looking at the source of a third party HTML sanitizer. After parsing the HTML into a DOM tree, the code does two things:
- Deletes all elements and attributes not on the whitelist
- Encodes all attribute values with
UrlPathEncode
What is the latter for, what kind of attack is it meant to prevent? Some flavor of XSS most likely, but which pathway? Sneaking JavaScript in event handler attributes will be prevented by the white list, won't it?
Meanwhile, unconditional url-encoding of all attributes will mess up some user visible text, like alt on images and title on links. The browser doesn't url-decode those, I've checked.