I need to crawl a particular website to dig out some relevant information. Looks like first I have to search the site to get corresponding URLs which when crawled will give me the detailed information.
Let's assume, the search url is
example.com/city1/search.html?cat=category1&locality=location1&page=1
This means, there can be city2, city3 etc. category can be category2, category3 etc and so on for location and page.
I have collected all the cities, categories, locations and pages can be incremented till the result is not null.
After getting all the URLs, I'll have to dig out the detailed information from each URL. I have seen that certain necessary information are available as part of javascript.
Now, I have seen node.io, jsdom and phantomjs. I have also seen yql. Since I am new to this, Please suggest me from your experience, which one is ideal in this scenario.
If you can cite some example, it'd be awesome.
PhantomJS could run the javascript in the URL you are giving it, very useful if the URL contains javascript/ajax content. YQL doesnt run the javascript/ajax in the website though, but its fast to get something up