How do I get a list of all existing URLs of a website matching certain pattern?

69 Views Asked by At

I am trying to analyze all existing URLs of a website that have a certain path. To demonstrate it on an example, the URL pattern is as follows:

https://www.example.com/users/john

and I am trying to get a list of existing URL starting with "https://www.example.com/users/".

So the desired output would be something like this:

https://www.example.com/users/john
https://www.example.com/users/alice
https://www.example.com/users/bob
https://www.example.com/users/jeff
https://www.example.com/users/sarah
...

There's no sitemap. How do I get such a list?

1

There are 1 best solutions below

0
Mahendhar On

To generate a list of existing URLs following a specific pattern without a sitemap, you can use web scraping techniques. Here's a general approach using Python with the BeautifulSoup library:

Send HTTP requests to the website and retrieve its HTML content. Parse the HTML content to extract URLs matching the desired pattern. Store the extracted URLs in a list. Here's a sample Python code demonstrating this approach:

import requests
from bs4 import BeautifulSoup
import re

base_url = "https://www.example.com/users/"
pattern = re.compile(r'^https://www.example.com/users/.*$')

def extract_urls(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        urls = [a['href'] for a in soup.find_all('a', href=True) if pattern.match(a['href'])]
        return urls
    except Exception as e:
        print(f"Error fetching URL {url}: {e}")
        return []

def get_all_urls(base_url):
    all_urls = [base_url]
    queue = [base_url]

    while queue:
        current_url = queue.pop(0)
        extracted_urls = extract_urls(current_url)
        for url in extracted_urls:
            if url not in all_urls:
                all_urls.append(url)
                queue.append(url)

    return all_urls

if __name__ == "__main__":
    all_urls = get_all_urls(base_url)
    for url in all_urls:
        print(url)

Replace "https://www.example.com/users/" with the actual base URL of the website you want to scrape. This script will recursively crawl through the website starting from the base URL and extract all URLs matching the specified pattern. It will then print out the list of URLs found.