url data task is not showing the right content when parsed with SwiftSoup? Swift 5

438 Views Asked by At

I am pretty new to swift and have an app that performs a simple url data task to parse the html contents of that website. I was trying to load certain elements but wasn't getting the content that I was seeing on the website when I inspect it manually. I don't really know what the problem.

I guess my question is; is there a way to load content as it would come up if I manually searched this website?

Here is the relevant code:

import SwiftSoup

let config = URLSessionConfiguration.default
config.httpAdditionalHeaders = ["User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"]
        
let session = URLSession(configuration: config)
        
let url = URL(string: link)

let task = session.dataTask(with: url!) { [self] (data, response, error) in            
    do {
        let htmlContent = NSString(data: data!, encoding: String.Encoding.utf8.rawValue)
        let doc: Document = try SwiftSoup.parse(htmlContent! as String)

        let elements = try doc.getAllElements().array()                    
                    
    } catch Exception.Error(type: let type, Message: let message) {
        print(type)
        print(message)
    } catch {
        print("error")
    }
                
}
            

Please let me know if there is any way to do this, even if it involves using a different package to parse the data. It is very important for my app. I would highly appreciate any help possible!

Thanks.

2

There are 2 best solutions below

1
aadi sach On BEST ANSWER

I found a solution that works for me. Here is the relevant code:

private let webView: WKWebView = {
    let prefs = WKPreferences()
    prefs.javaScriptEnabled = true
    let config = WKWebViewConfiguration()
    config.preferences = prefs
    let webView = WKWebView(frame: .zero, configuration: config)
    return webView
}()

override func viewDidLoad() {
    super.viewDidLoad()
      
    view.addSubview(webView)
    webView.navigationDelegate = self
 
}

func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
    parseData()        
}


func parseData() {
        
    DispatchQueue.main.asyncAfter(deadline: .now() + 5.0) { [unowned self] in

        webView.evaluateJavaScript("document.body.innerHTML") { result, error in
            guard let htmlContent = result, error == nil else {
                print("error")
                return
           }                
                
           do {
               let doc = try SwiftSoup.parse(htmlContent as! String)
               var allProducts = try doc.getAllElements.array()
           } catch {
               print("error")
           }
                
       }
  
   }   
        
}

Using a WebView to load the website first, then parse the data after a delay is a working solution for me. It might not be the best idea to have a fixed delay, so if any has any other suggestion it would be highly appreciated!

6
Najinsky On

I suspect the issue may be your user agent that is being sent to the website whose response you are parsing.

The user agent is a string that is sent with the request to the url (as an additional header). It identifies what sort of thing you are so that an appropriate response can be sent.

For example, if you are requesting from Safari on Mac on Big Sur the user agent might be:

"Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15"

Whereas from iPad it might be:

"Mozilla/5.0 (iPad; CPU OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1"

The site serving the request uses the user agent to determine what kind of response to return and what features to include (full site, mobile site, text site, etc).

For a URLSession in a Swift app, the user agent is the app's bundle name. So the site may be getting confused by that and returning something different than you see when you visit it in a browser.

Some options:

Explore the site, it might have a better url to use to get the info you are after.

Change the user-agent string your are sending. The basic steps are:

let config = URLSessionConfiguration.default
config.httpAdditionalHeaders = ["User-Agent": "User-Agent String Here"]
let session = URLSession(configuration: config)

You may need to adapt your use of the shared session to support this (eg: either create a session with your config and use that, as above, or check if there is a way to override the header for your request using the shared session).