How to accurately replace scripts / html before saving data from servlet to database

41 Views Asked by At

So this sounds like a very basic question, but I seem to find very little information about it online.

I imagine a simple replace with regex, something like

replaceAll("\\<.*?>", "") 

but is it really sufficient? Isn't there a widely accepted 'best' way to do this?

This is mainly to prevent script injections, so is replacing a script tag really all that is necessary?

Is there any added thing that can be done via Client-side script itself to prevent it before it even reaches the server?

EDIT: To be clear. Users shouldn't be able to enter HTML at all regardless. These are text inputs for names, emails, descriptions, etc.

1

There are 1 best solutions below

0
JustinH On BEST ANSWER

I believe there is little information online about this because its nuanced. That regex might be fine. A potential flaw could be: the input, for whatever reason was escaped before the regex, then unesecaped afterwards. Another potential flaw is the user thinks its okay to use html and their data is stored incorrectly. You can get pretty convoluted in your examples here, but the point is there is not a silver bullet to input sanitation.

Hands down the best response is here when-is-it-best-to-sanitize-user-input from 'Your Common Sense' and 'Kibbee'. I believe these answer the intent of your question, and they do a better job of it than i could. Since it doesn't directly address your question, I'll add a bit more here..It's good to already have some validation in place on the client side, especially for things like emails (user sanity really); adding validation for tags is good if the user for some reason would expect it might be okay to use html. Ultimately though, if your worried about html inputs to your server, you cannot ensure that on the client side. Thankfully this is OK though! This is a little aside, but since you mention specifically data going into a db, the biggest sanity saver for the data going into your db, as also stated by Kibbee, is ~prepared statements~, and of course sanitation revolving around typing. Dangerous Html specifically, is dangerous when you render it, to be clear.

If this is all feeling a bit generalized, the simple answer is any sneaky html in your data is only dangerous depending on how you use it, and you want to be careful about how future you (or anyone else working on the project) might decide to use it. For example, escaping <,> to &lt;, &gt; will make the browser treat the html as plain text, but if you hit your head tomorrow and un-escape all input in your clientside then maybe your in trouble. And again if you only replace <> with regex, but it was already escaped, then you'd miss it, and perhaps its in a scenario where you or someone else, decide to un-escape data for whatever reason! It's possible. So, if you wanted to be uber dooper super safe at all costs you could remove all <>, any possible equivalent, all possible html tags everr, and all the potential code inside, even better just remove all input (ok im taking it way too far). I don't recommend this in almost all cases though, for the reasons outlined in the answers provided. Weighing potential misuses and data loss is all part of the problem (amongst the many other considerations needed to be weighed in software dev). But, if you ever have a specific instance you are unsure about (we all do sometimes), I'm sure people will be available to help you out =)

Hope this helps!