I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage. Can someone please help. How can i scrape the HTML content.
apache-nutch version 1.19
Here the steps to fetch a URL and to export the HTML of the fetched page:
nutchstands for...nutch_install_path/bin/nutch.echo https://nutch.apache.org/ >seeds.txtnutch inject crawldb seeds.txtnutch generate crawldb/ segments/nutch fetch segments/20230310113604/(the segment name is a time stamp, it needs to be adapted)nutch parse segments/20230310113604/(only required if metadata, outlinks or plain text are required)segdump/dumpnutch readsegto get the help for more command-line options