JSoup - Unable to retrieve tag of type video from Instagram

72 Views Asked by At

Firstly, let me say I am a backend developer and haven't tried to parse a HTML document in probably 15 years so grant me patience. Plus, I don't really know how Instagram works, which is why I'm trying to learn about it.

I am trying to download a video from Instagram and the video is in tag 'video'. I have been creating different ways to iterate over the children elements of org.jsoup.nodes.Document. It seems no matter what I do, I am unable to identity the tag. I tried using the class method Document.children().select(*). I am wondering if Instagram has some how 'hid' the video source. I really have no idea.

I also expected there to be a meta tag called og:video, but this one does not exist (title, img etc does). I tried to access it like this:

page.select("meta[property=og:video]").first().attr("content");

This is a screenshot of dev tools

In the instagramDownloader class, there are two recursive methods to go through all the nodes and elements, neither of which gives me any clue as to how to retrieve the video. I did find this recursive method on another stack overflow question. I don't even know that if I have the src URL if it is possible to download the video.

`

public class Application {

    public static void main(String[] args) {

        try {
            login();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void login() throws IGLoginException, InterruptedException, ExecutionException{

        IGClient client = IGClient.builder().username("myuser").password("mylogin").login();
        
        InstagramDownloader dl = new InstagramDownloader();
        dl.downloadVideo("https://www.instagram.com/reel/CzeWZCYJ09R/", "C:\\temp");
    }

public class InstagramDownloader {

    private Document page;
    private final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36";

  public void downloadVideo(String url, String targetDirectory){
        String videoUrl = "";

        Helpers.validateURL(url);
        try {
            page = Jsoup.connect(url).userAgent(USER_AGENT).get();
            getAllElements(page);
            getAllNodes(page);
            //videoUrl = ???
            
        } catch (IOException e){
            e.printStackTrace();
        }      
       download(videoUrl, targetDirectory);
    }

 public void getAllElements(Document doc) {
         Elements children = new Elements();
         recurseOverElements(doc.getAllElements(), children);

         for (Element element : children) {
             System.out.println(element.tagName());
         }
    }

 public Elements recurseOverElements(Elements elementList, Elements children){
        if(elementList.size() == 0)
            return children;

        for (Element element : elementList) {

            recurseOverElements(element.children(), children);
            children.add(element);
        }
        return children;
    }
    
    public void getAllNodes(Document doc) {
        List<Node> allNodesInDom = new ArrayList<>();
        recurseOverNodes(doc.childNodes(), allNodesInDom);

        for (Node node : allNodesInDom) {
            System.out.println(node.nodeName());
        }
   }
    
    public List<Node> recurseOverNodes(List<Node> nodeList, List<Node> allChildNodeList){
        if(nodeList.size() == 0)
            return allChildNodeList;

        for (Node node : nodeList) {
            recurseOverNodes(node.childNodes(), allChildNodeList);
            allChildNodeList.add(node);
        }
        return allChildNodeList;
    }

private void download(String url, String targetDirectory){
        String[] tempName = url.split("/");
        String filename = tempName[tempName.length-1].split("[?]")[0];

        try(InputStream inputStream = URI.create(url).toURL().openStream()){
            int x = inputStream.read();
            System.out.println("x" + x);
            HttpURLConnection conn = (HttpURLConnection)URI.create(url).toURL().openConnection();
            Path targetPath = new File(targetDirectory + File.separator + filename).toPath();
            Files.copy(inputStream, targetPath, StandardCopyOption.REPLACE_EXISTING);

            int BYTES_PER_KB = 1024;
            double fileSize = ((double)conn.getContentLength() / BYTES_PER_KB);
        } catch (IOException e){
            e.printStackTrace();
        }
}
1

There are 1 best solutions below

0
rzwitserloot On BEST ANSWER

I have bad news for you: You can take your code and toss it in the garbage. Your plan fundamentally is never going to work here.

The problem you're running into, and it is one you tend to run into these days for almost anything, is that JSoup cannot actually parse the modern web.

The problem is simply this: The HTML your browser downloads (i.e. the stuff you feed to JSoup) has pretty much zero content in it. Instead, the HTML causes a bunch of javascript to run, and that javascript does all sorts of network requests, and creates more HTML with the actual content.

JSoup is just an HTML parser. It's not a javascript engine. If you want a javascript engine, that's really complicated and requires, more or less, an entire browser: A very heavy job. If you want to look into that, you probably want to try selenium. Hence, all that HTML that the javascript makes with the content in it? It doesn't exist in the stuff JSoup sees, and trivially JSoup cannot give you things that just aren't there.

When you rightclick in your browser and pick 'inspect element...' from it you're looking at the live DOM - which starts out identical to the HTML page downloaded from the server, but can be modified by the javascript that it runs, and on most modern sites, it's been modified so much by that javascript it's the difference between a crude tent and a cathedral.

Instead, pick 'show source' or just use curl to fetch the actual URL and check if the info you want is in there. Odds are extremely high it is not.

If it is not, JSoup is not going to help you.

Generally websites have APIs for this sort of thing. That's the right tool for the job here. Not "I will just act like a browser and parse that video URL out". Note that this gets around moderation and especially advertising and user management, so the builders of these sites are actively trying to fight you. That doesn't make your job impossible, just [A] crazy difficult, [B] illegal in certain jurisdictions (shitty jurisdictions, but, the USA might well be one of em. The DMCA is not a well written law), and [C] a perennial significant maintenance headache. instagram does not sit still. What works today is likely not to work tomorrow. Especially if they notice in the logs you're doing it and try to stop you.

This is why APIs exist. The server builders streamline what they do and don't support, make it simpler for you, add whatever authentication they need to cover whatever legal and marketing needs they have (API keys and such), and they can then offer and support a stable thing, instead of breaking half the web when they decide to lightly restyle their front page.