I'm trying to get urls from a web page.
I've tried to use wget and urllib and lynx (which returned the most organised results), but the tricky part is that urls are written on the webpage as text and if they're long then the rest of the url will be dotted (3 dots) (e.g. examppppppppppppple.com will be written as examppp...) in order to view it you'll have to click on the id of the entry which will open a new window and in that window the url will be written fully also as a text. I managed to get the urls but I didn't know how to enter another page and get the text "url" if it was dotted, and i'm not sure if wget -r will work in my case (since the url is a text).
This's what I wrote
import os
def get_urls():
os.system("lynx -dump https://www.example.com/
| grep -v https://ww.example.com/* | grep https* | grep http* | cut -f5- -d' '>
urls.txt")
- in this line
grep -v https://ww.example.com/*i'm excluding all the website's links, because I only want the entires in the website i'v also tried to use -listonly but that would only list the page's urls.
output
http://www.another-example...
https://example1.com
https://www.example.com
Updates for mid-2020
If I understand the task correctly, it is to get a list of the urls embedded in a web page that are not the same base url as the web page itself. So if the page is https://example.com, then list all non-'example.com/..' urls.
Using external Lynx program
Calling Lynx, Python 3.5 and above Calling Lynx, Pre-Python 3.5The output for these examples should be like: