Using Java & Apache Nutch to scrape dynamic elements from a website

286 Views Asked by Prachi Sharma At 09 March 2023 at 09:01

I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage. Can someone please help. How can i scrape the HTML content.

apache-nutch version 1.19

Original Q&A

There are 2 best solutions below

Sebastian Nagel On 10 March 2023 at 10:58 BEST ANSWER

Here the steps to fetch a URL and to export the HTML of the fetched page:

Install Nutch and configure the agent name as described in the Nutch tutorial. Except for the agent name all other configuration settings are the default ones. The next steps are run in an empty directory. The command nutch stands for ...nutch_install_path/bin/nutch.
place the URL into the seed file: echo https://nutch.apache.org/ >seeds.txt
inject the seed into the CrawlDb: nutch inject crawldb seeds.txt
generate a segment: nutch generate crawldb/ segments/
fetch the generated segment: nutch fetch segments/20230310113604/ (the segment name is a time stamp, it needs to be adapted)
(optionally) parse the segment: nutch parse segments/20230310113604/ (only required if metadata, outlinks or plain text are required)

get the record of the URL (it includes the HTML but also more information):

$> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
...
Content:
<!DOCTYPE html>
<html lang="en-us">

<head>
  <meta name="generator" content="Hugo 0.92.2" />
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title> Apache Nutch™ </title>
  ...

(alternatively) dump the segment:
```
nutch readseg -dump segments/20230310113604/ segdump -recode
```
- the HTML text is written to segdump/dump
- it's recoded to UTF-8
- run nutch readseg to get the help for more command-line options

Sebastian Nagel On 09 March 2023 at 16:01

The raw content of a page (HTML but could be also a binary format such as PDF) is stored in the segments in the subfolder "content". Note, that the content is only stored

if the property fetcher.store.content is true (this is the default) and
if fetching was successful (a trial to fetch the given URL resulted in a HTTP 403 Forbidden). Very likely the site is protected.

Using Java & Apache Nutch to scrape dynamic elements from a website

There are 2 best solutions below

Related Questions in JAVA

Related Questions in WEB-SCRAPING

Related Questions in WEB-CRAWLER

Related Questions in NUTCH

Trending Questions

Popular # Hahtags

Popular Questions