I am trying to scrap the content (Example Thumbnail picture of a course, price etc.) from an educative website - Udemy, by searching in a general URL string (given below). The source code of the site has a division with class name - "ud-app-loader ud-component--search--search". Snippet of it is provided below.
Website to scrap data from (search query is Selenium): Udemy search for selenium courses available
<div class="ud-main-content">
<div class="ud-app-loader ud-component--search--search" data-module-id="search" data-module-args="{"subsCollectionIds":null,"showSRPRefundNotice":false,"showUserEnrollmentProgress":false,"showCodingExerciseCount":false,"enableLabsInPersonalPlan":false,"enableLectureBottomDrawerOnSRP":false,"showCodingExercisesBadge":false,"enableLectureDiscoveryUnitInUb":false,"disableRelatedTopicsOnSRP":false,"enableActiveLearningElementIcons":false}"></div>
</div>
But the same <div class="ud-app-loader ud-component--search--search"*> looks different in Inspect window. It has multiple sub-divisions beneath it (for each course there is
<div-class="popper-module--popper--2BpLn"> associated.
Now, since I am not so familiar with Front-end technologies, as of now, but I am assuming (after reading a similar article on stackoverflow, scraping "data-module-group" with BeautifulSoup) that data is being fetched via AJAX calls. But I cannot even find AJAX URLs in the page.
Similar question: Extract details from <div data-module-group=> using BeautifulSoup
Initially I was planning to use Jsoup for scraping the content, but later explored that Jsoup cannot get such asynchronous calls. It is just an HTML parser, so I am using HTMLUnit now.
My code implementation won't help much here, but still adding it too, for reference.
public class Scraper {
public static void getData(String courseName,String sortType) throws Exception {
String URL="https://www.udemy.com/courses/search/?lang=en&price=price-paid&q="+courseName+
"&ratings=4.5&sort=relevance&sort="+sortType+"&src=ukw";
WebClient client=new WebClient(BrowserVersion.FIREFOX);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page=client.getPage(URL);
client.waitForBackgroundJavaScript(50000);
System.out.println(page.asXml());
}
}
I have printed the page and not filtered the element, with pre-defined methods, in above code, as I can do it later. My priority is to get the desired HTML page first.
My doubts are:
- If its an AJAX call to an AJAX source URL, then where/how can I find that URL in the page. What should be next steps to get sub-divisions of <div-class="ud-app-loader"
- If not an AJAX, then what is it actually and how can I proceed to extract the data from this data module? If not HTMLUnit, but some other tool then also its fine.
Would be glad if anyone can help or even direct me to the way of getting solution.
-Abhay.
So I hear you're trying to scrape data from Udemy's search results page for Selenium courses using HTMLUnit, but you're having a hard time finding the right info. It could be because the data is loaded dynamically through AJAX calls, so you need to find the URL that's responsible for making these calls.
Here's what you can try: use the developer tools in your browser to inspect the network requests when you perform a search on Udemy. Look for requests that contain the data you need and check out the URL to see if you can tweak any parameters to extract more data.
Once you've found the AJAX source URL, you can simulate the AJAX request using HTMLUnit and extract the desired information from the response.
If you're still stuck, it could be that the data is being loaded using a different mechanism. In that case, you might need to explore other scraping tools or techniques to get the data you need.