How to parse out JavaScript from arbitrary HTML

116 Views Asked by At

I'm using jericho for sanitizing html and it works great. Except in one situation I can't figure out. I want to completely remove any script and the script content. Right now I'm getting the script tag removed but the actual script content is kept.

So currently I create a Source object and do a fullSequentialParse. Then I create an OutputDocument and loop through each tag.

When I get to a "script" tag I just want to replace the whole thing with "".

Any ideas?

TIA

2

There are 2 best solutions below

0
meskobalazs On

I am not familiar with Jericho, however it has the capability to work on a tree, very similar to a DOM tree, so you can remove the script element instead of just the tag. (If you have a huge HTML, this may not be optimal, though).

If not, then you can go for the SAX way of things. Remember the opening script tag, and when you reach the closing tag, you can remove everything inbetween.

0
pro_cheats On

Simple and efficient method -

  1. Do the traversal to reach script Tags one by one.
  2. For every script Tag you can get its next end Tag (use a for loop).
  3. Get positions(Integer value) of start Tag and end Tag.
  4. Remove those lines from your source object.
  5. Replace the source file. (just create a new file and save in same folder, it'll overwrite)

A2A :)