I want to extract only some specific body innertexts from a particular HTML page. By specific, I mean only the innertexts which are related to the title of the page.
The thing is that I am scraping data from a public website, and it's full of advertisements, signup forms, etc, which are unnecessary and irrelevant to the title of the page.
I was going through the internet and found out a library call SimpleHTMLDOM, so I implemented it into my project. This is how I am fetching the body texts from the website:
include('simple_html_dom.php');
$html = file_get_html("URL of the website here");
if($html){
if($html->find('p')){
foreach($html->find('p') as $element){
echo $element->plaintext.'<br>';
}
}
}
As I feared, it's fetching all the unnecessary texts too which are inside the <p> tag. So my question is how do I segregate among the unecessary texts (from ads, etc) and the main body texts? Or is there any other library which will help me in doing so? Please guide me.
EDIT: As of now I am trying to fetch the necessary body text (excluding ads,etc) from this url: