Best Practices when outputting html data from database to browser

658 Views Asked by At

I store html data in a database.

The html data is very simple, and is generated by a wysiwyg editor.

Before I store the html data in the database and I run it through HTMLPurifier, to remove any badness.

When I output data back out to the browser, because it is html data, obviously I cannot use php's htmlspecialchars().

I am wondering if there are any problems with this as far as XSS attacks are concerned. Is passing the data through HTMLPurifier before saving in the database enough? Are there any things I am missing / other steps I should be taking?

Thanks (in advance) for your help.

3

There are 3 best solutions below

1
On

I've never had an issue with mainstream richtext editors.

XSS happen when people are able to embed raw html into your page using web forms, the input of which you output at a later date (so always encode user input when writing to screen).

This can't happen with a (good) text editor. If a user types in html code (e.g. < or >), the text editor will encode it anyway. The only tags it will create are its own.

1
On

There is a function htmlspecialchars, that will encode characters into their html equivalent. For example < becomes &lt;

In addition you may want to clean out any suspicious tags. I wrote a short js function a while ago to do this for a project (by no means all-inclusive!) You may want to take this and edit it for your needs, or base your own off of it...

    <script language="javascript" type="text/javascript">

    function Button1_onclick() {
        //get text
        var text = document.getElementById("txtIn").value;
        //wype it
        text = wype(text);
        //give it back
        document.getElementById("txtOut").value = text;
    }

    function wype(text) {
        text = script(text);
        text = regex(text);
        return text
    }


    function script(text) {
        var re1 = new RegExp('<script.*?>.*?</scri'+'pt>', 'g');
        text = text.replace(re1, '');
        return text
    }

    function regex(text) {
        var tags = ["html", "body", "head", "!doctype", "script", "embed", "object", "frameset", "frame", "iframe", "meta", "link", "div", "title", "w", "m", "o", "xml"];
        for (var x = 0; x < tags.length; x++) {
            var tag = tags[x];
            var re = new RegExp('<' + tag + '[^><]*>|<.' + tag + '[^><]*>', 'g');
            text = text.replace(re, '');
        }
        return text;
    }
</script>
0
On

What you are doing is correct. You may also consider filtering on the way just to be sure. You mentioned you are using HTMLPurifier - which is great. Just don't ever try to implement a sanitizer on your own, there are lots of pitfalls in that approach.