Web Scraping - perform "copy all" instead of HTML Parsing

36 Views Asked by At

I need suggestions about how to capture the data from a webpage without normal HTML parsing. The data is rendered to the screen via behind the scenes scripts and server side methods that I can't seem to unravel. I use BeautifulSoup and Selenium regularly, but this output is different.

The web page is very simple and static. While I cannot get the BS and Selenium to work, a simple "copy all" works perfectly. (yes, the old manual way!).

Is there any advice about how to automate this? Basically - > Go to Website, press "Copy ALL", return data to python and/or save to file for archive.


I have tried many different methods of BS and Selenium and only get part of the data. I think the rendering is done as an "anti-piracy" method. I've fought with this website many times in the past, and the way they post the data seems to be deliberately inconsistent. Randomly, the website works perfectly if I do the manual "copy/paste" thing, except that's not automated.

https://datawrapper.dwcdn.net/vEKjO/39/

1

There are 1 best solutions below

0
Andrej Kesely On

The data you see on the page is loaded from different URL in CSV form. To load it you can use this example:

from io import StringIO

import pandas as pd
import requests

url = "https://datawrapper.dwcdn.net/vEKjO/39/dataset.csv"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0"
}

data = requests.get(url, headers=headers).text

df = pd.read_csv(StringIO(data), sep="\t")
print(df)

Prints:

                                   Unnamed: 0 Feb-24 Jan-24 Dec-23 Nov-23 Oct-23 Sep-23 Aug-23
0                                         NaN      %      %      %      %      %      %      %
1                     ECONOMIC PROBLEMS (NET)     30     34     32     33     38     34     31
2                          Economy in general     12     12     14     13     14     16     15
3               High cost of living/Inflation     11     13     12     10     14      9      8
4         Federal budget deficit/Federal debt      3      2      2      3      4      3      2
5                                       Taxes      2      1      1      1      *      1      *
6                           Unemployment/Jobs      2      2      2      2      2      3      2

...