How to parse a Html and get the result as a String using Java

469 Views Asked by At

I want to Parse a Html and get the result as a string. Given that the Body of the Outer Html contains another Html String, I want that inner Html as output String.

Example> Input HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head></head><body><p>&lt;!DOCTYPE html&gt;<br />&lt;html&gt;<br />&lt;body&gt;<br /><br />&lt;h1&gt;My First Heading&lt;/h1&gt;<br /><br />&lt;p&gt;My first paragraph.&lt;/p&gt;<br /><br />&lt;/body&gt;<br />&lt;/html&gt;<br /><br /></p></body></html>

Output String :

<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>

Important : I am using a HTML editor in which if I input something, it returns the HTML represantation for that Input on doing getText, the first Html String above is that representation only.

Also the output string should be same as when I run the first String here(http://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic)

Please help me with this.

1

There are 1 best solutions below

2
Vyncent On

i would go with some regexp :

(<!DOCTYPE html>).*(<html>.*</html>).+

And taking group 1 and group 2,

    tst = tst.replaceAll("<", "<").replaceAll(">",">");
    Pattern p = Pattern.compile("(<!DOCTYPE html>).*(<html>.*</html>).*</html>.*");
    Matcher m = p.matcher(tst);
    m.find();
    System.out.println(m.group(1) + m.group(2));

exemple runnning : http://rextester.com/JTOJ89529