Mechanize Rails - Web Scraping - Server responds with JSON - How to Parse URL from to Download CSV

709 Views Asked by At

I am new to Mechanize and trying to overcome this probably very obvious answer.

I put together a short script to auth on an external site, then click a link that generates a CSV file dynamically.

I have finally got it to click on the export button, however, it returns an AWS URL.

I'm trying to get the script to download said CSV from this JSON Response (seen below).

Myscript.rb

require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'    
USERNAME = "myemail"
    PASSWORD = "mysecret"
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"

    mechanize = Mechanize.new do |a|
      a.user_agent = USER_AGENT
    end

    form_page = mechanize.get('https://XXXX.XXXXX.com/signin')
    form = form_page.form_with(:id =>'login')
    form.field_with(:id => 'user_email').value=USERNAME
    form.field_with(:id => 'user_password').value=PASSWORD
    page = form.click_button

    donations = mechanize.get('https://XXXXX.XXXXXX.com/pages/ACCOUNT/statistics')
    puts donations.body

    donations = mechanize.get('https://xxx.siteimscraping.com/pages/myaccount/statistics')
    bs_csv_download = page.link_with(:text => 'Download CSV')

JSON response from website containing link to CSV I need to parse and download via Mechanize and/or nokogiri.

{"message":"Find your report at https://s3.amazonaws.com/reports.XXXXXXX.com/XXXXXXX.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=AKIAIKW4BJKQUNOJ6D2A%2F20190228%2Fus-east-1%2Fs3%2Faws4_request\u0026X-Amz-Date=20190228T025844Z\u0026X-Amz-Expires=86400\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=b19b6f1d5120398c850fc03c474889570820d33f5ede5ff3446b7b8ecbaf706e"}

I very much appreciate any help.

1

There are 1 best solutions below

3
mkrl On

You could parse it as JSON and then retrieve a substring from the response (assuming it always responds in the same format):

require 'json'

...

bs_csv_download = page.link_with(:text => 'Download CSV')
json_response = JSON.parse(bs_csv_download)
direct_link = json_response["message"][20..-1]
mechanize.get(direct_link).save('file.csv')

We're getting the 20th character in the "message" value with [20..-1] (-1 means till the end of the string).