I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}