Get output of web page in C#

1.5k Views Asked by At

I am attempting to get the resulting web page content so I can extract the display text. I have attempted the code below but it gets me the source html and not the resulting html.

string urlPath = "http://www.cbsnews.com/news/jamar-clark-protests-follow-decision-not-to-file-charges-in-minneapolis-police-shooting/";
WebClient client = new WebClient();
string str = client.DownloadString(urlPath);

Compare the text in the str variable with the html in the Developer Tools in the Chrome browser and you will get different results.

Any recommendations will be appreciated.

2

There are 2 best solutions below

1
squillman On

I'm assuming you mean that you want the article text. If so you will need to follow a different course of action. The page you refer to is loaded with client script that injects loads of content into the base HTML document. This is done by executing the client-side script. You will need to parse the DOM after the script is executed to get the content you're interested in.

0
saucecontrol On

As others have pointed out, an actual web browser will parse the downloaded HTML and execute javascript against it, potentially altering its content. While you could try to do that parsing yourself, the easiest route is to ask a real web browser to do it for you and then grab the results.

The easiest solution specifically in C# would be to use the WebBrowser Control from Windows Forms, which essentially exposes IE to your program, allowing you to control it. Use the Navigate method to load the URL in question, then use the Document property to navigate the DOM. You can, at that point, get the outerHTML to get the final content of the DOM as HTML.

If you're not writing a Windows program and are interested more in headless operation, have a look at PhantomJS. It's a headless Webkit browser that is scriptable from javascript and would give you similar capability, although not in C#.