How do I get a list of all existing URLs of a website matching certain pattern?

Question

How do I get a list of all existing URLs of a website matching certain pattern?

69 Views Asked by user984621 At 16 February 2024 at 16:37

I am trying to analyze all existing URLs of a website that have a certain path. To demonstrate it on an example, the URL pattern is as follows:

https://www.example.com/users/john

and I am trying to get a list of existing URL starting with "https://www.example.com/users/".

So the desired output would be something like this:

https://www.example.com/users/john
https://www.example.com/users/alice
https://www.example.com/users/bob
https://www.example.com/users/jeff
https://www.example.com/users/sarah
...

There's no sitemap. How do I get such a list?

Original Q&A

There are 1 best solutions below

**Mahendhar** · Answer 1 · 2024-03-12T11:25:03.950000

To generate a list of existing URLs following a specific pattern without a sitemap, you can use web scraping techniques. Here's a general approach using Python with the BeautifulSoup library:

Send HTTP requests to the website and retrieve its HTML content. Parse the HTML content to extract URLs matching the desired pattern. Store the extracted URLs in a list. Here's a sample Python code demonstrating this approach:

import requests
from bs4 import BeautifulSoup
import re

base_url = "https://www.example.com/users/"
pattern = re.compile(r'^https://www.example.com/users/.*$')

def extract_urls(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        urls = [a['href'] for a in soup.find_all('a', href=True) if pattern.match(a['href'])]
        return urls
    except Exception as e:
        print(f"Error fetching URL {url}: {e}")
        return []

def get_all_urls(base_url):
    all_urls = [base_url]
    queue = [base_url]

    while queue:
        current_url = queue.pop(0)
        extracted_urls = extract_urls(current_url)
        for url in extracted_urls:
            if url not in all_urls:
                all_urls.append(url)
                queue.append(url)

    return all_urls

if __name__ == "__main__":
    all_urls = get_all_urls(base_url)
    for url in all_urls:
        print(url)

Replace "https://www.example.com/users/" with the actual base URL of the website you want to scrape. This script will recursively crawl through the website starting from the base URL and extract all URLs matching the specified pattern. It will then print out the list of URLs found.

How do I get a list of all existing URLs of a website matching certain pattern?

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in GOOGLE-CRAWLERS

Trending Questions

Popular # Hahtags

Popular Questions