getting text from web page in python

525 Views Asked by At

I'm trying to get urls from a web page.

I've tried to use wget and urllib and lynx (which returned the most organised results), but the tricky part is that urls are written on the webpage as text and if they're long then the rest of the url will be dotted (3 dots) (e.g. examppppppppppppple.com will be written as examppp...) in order to view it you'll have to click on the id of the entry which will open a new window and in that window the url will be written fully also as a text. I managed to get the urls but I didn't know how to enter another page and get the text "url" if it was dotted, and i'm not sure if wget -r will work in my case (since the url is a text).

This's what I wrote

import os

def get_urls():
     os.system("lynx -dump https://www.example.com/ 
     | grep -v https://ww.example.com/* | grep https* | grep http* | cut -f5- -d' '> 
      urls.txt")
  • in this line grep -v https://ww.example.com/* i'm excluding all the website's links, because I only want the entires in the website i'v also tried to use -listonly but that would only list the page's urls.

output

http://www.another-example... 
https://example1.com
https://www.example.com
1

There are 1 best solutions below

0
DC Slagel On

Updates for mid-2020

If I understand the task correctly, it is to get a list of the urls embedded in a web page that are not the same base url as the web page itself. So if the page is https://example.com, then list all non-'example.com/..' urls.

Using external Lynx program

Calling Lynx, Python 3.5 and above
# Since Python 3.5 
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.run(
        ["lynx", "-listonly", "-dump", siteurl],
        capture_output=True,
        encoding='utf-8',
        timeout=3,
    )
    result.check_returncode()
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.stdout.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)
Calling Lynx, Pre-Python 3.5
# Pre-Python 3.5
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.check_output(
        ["lynx", "-listonly", "-dump", "https://stackoverflow.com"],
        stderr=subprocess.PIPE,
        encoding='utf-8',
        timeout=2
    )
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)

The output for these examples should be like:

head list
https://stackexchange.com/sites
https://stackoverflow.blog/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://stackoverflowbusiness.com/
...