Webscraping password protected site - how implement httr2 to enter in own profile conversations

118 Views Asked by At

I am used to webscrape with R but when it comes to webscrape password protected sites I face difficulties. My goal is to simulate an internet browser and read-in my conversations in the logged-in session page. In the following site it is pretty easy to register and create an account. The idea of the page is that user can find online flats and apply for them.

I am not looking for Rselenium solutions :)

wg_site="https://www.wg-gesucht.de"
credentials_id <- list(login_email_username = wg_id,
                       login_password  = wg_pw )

I found 2 ways of entering:

Type A - submit a form


wg_req_form <- request(wg_site) %>%                  #base url   
  req_url_path_append("/nachrichten.html")%>%        #path   
  req_body_form(login_email_username=wg_id,
 login_password=wg_pw)                               #submit credentials

wg_form_resp <- wg_req_form %>% 
  req_perform() %>%
  resp_body_html()
wg_form_resp %>%
  html_nodes("#main_column > div.row.my40 > div > div > div > div") %>%
  html_text() %>%
  gsub("  ","",.) %>%
  gsub("\n"," ",.)
  • The log in did not work. I wonder if I am missing another field to push (POST)

Type B - find the json file to extract session token and apply it together with the credentials

wg_session_token <- read_html("https://www.wg-gesucht.de") %>%
  html_nodes(xpath="/html/body/script[21]") %>%
  html_attr("src") %>%
  sub("^/ajax/api/Smp/js/Session.min.tjb", "", .) %>%
  sub(".js$", "", .)

request(wg_site) %>%                                    
  req_url_path_append("/nachrichten.html")%>%             #path   
  req_body_json(list(credentials_id))%>%req_perform()     #credentials & resp check
  #req_url_query(login_token = wg_session_token)            
   

I wonder if this last token should be placed right after the Json approach or somwhere else. Besides there are several token that could be relevant when accessing in my account: When I was refreshing the site by logging in and out I found another token related link that could help.

https://www.wg-gesucht.de/ajax/sessions.php?action=login

I believe there are different types of token - like one for the session token & another one for the login.

So my question are:

  1. How to perform generally when webscraping password protected page?

    I rely on the schema: open a html_session->draw cookies and token-> post credentials via form o rendering json file -> GET (retrieve data according to data format)

  2. How to submit this form in this example? Are submitting forms better than find api endpoints to json file?

  3. How to log in with the json variant ? How is it possible to authenticate via json when inserting credentials and apply correctly right token? I guess it the answer to this is bound to find the right js on api endpoint

Thank you very much and any help is more than welcome!

0

There are 0 best solutions below