Jsoup parsing weird behavior

67 Views Asked by At

I'm trying do parsing with jsoup in loop, but after several iterations the loop starts all over again like in parallel thread, why does this happen?

Code of my method:

public void parser(String type, String someUrl) throws InterruptedException {
        List<Item> list = itemRepository.findAllByType(type);
        int count = 0;
        Document page;
        String url;
        Elements el;
        double price;
        double buy;

        for (Item item : list) {
            count++;
            System.out.println("iteration    " + count);
            url = someUrl + item.getBuffId();

            try {
                page = Jsoup.parse(new URL(url), 20000);
                Thread.sleep(2000);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            el = page.getElementsByAttribute("data-goods-sell-min-price");
            price = Double.parseDouble(el.attr("data-goods-sell-min-price")) / 100;
            buy = Double.parseDouble(el.attr("data-goods-buy-max-price")) / 100;
            item.setPrice(price);
            item.setBuyOrder(buy);
            try {
                item.setPercentage(round((100 * ((price * 0.975) - buy) / buy), 4));
            } catch (NumberFormatException e) {
                System.out.println(item.getName());
                System.out.println(price);
                System.out.println(buy);
            }
            item.setProfit(round((price * 0.975 - buy), 2));
        }
        itemRepository.saveAll(list);
    }

Console output: iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 iteration 1 iteration 7 iteration 2 iteration 8 iteration 3

Sometimes, but in very rare cases, everything works fine, usually the second thread starts after 7-10 iterations, sometimes after 30+ iterations.

P.S. May be this is improtant: I'm using spring boot application and this method is called in a GET request. P.P.S. Also tried make method synchronised but after last loop iteration, it's starts from begginning

2

There are 2 best solutions below

1
4EACH On

IMHO, in your use case, behind the scenes you create a new request every iteration sometimes it opens a new thread.

You can use connect() that returns new Connection object and then to make sure the connection ends in the end of iteration.

https://jsoup.org/apidocs/org/jsoup/Jsoup.html#connect(java.lang.String)


UPDATE:

Try to work on other objects and not the same objects you iterating on.

public List<Item> parseAndSaveItems(String type, String baseUrl) {
    List<Item> originalItemList = itemRepository.findAllByType(type);
    List<Item> newItemList = new ArrayList<>();

    for (Item originalItem : originalItemList) {
        Item newItem = processItem(originalItem, baseUrl);
        newItemList.add(newItem);
    }

    itemRepository.saveAll(newItemList);
    return newItemList;
}

private Item processItem(Item originalItem, String baseUrl) {
    Item newItem = new Item();

    // Copy relevant properties from the original item to the new item
    newItem.setType(originalItem.getType());
    newItem.setBuffId(originalItem.getBuffId());
    // Copy other properties as needed

    String url = baseUrl + originalItem.getBuffId();
    try {
        Document page = getPage(url);
        Thread.sleep(2000);

        Elements elements = page.getElementsByAttribute("data-goods-sell-min-price");
        double price = parsePrice(elements.attr("data-goods-sell-min-price"));
        double buy = parsePrice(elements.attr("data-goods-buy-max-price"));

        updateItemDetails(newItem, price, buy);
    } catch (IOException | InterruptedException e) {
        // Handle exceptions appropriately, log, and continue or rethrow based on requirements.
        log.error("Error processing item: {}", originalItem.getName(), e);
    }

    return newItem;
}
private Document getPage(String url) throws IOException {
    return Jsoup.parse(new URL(url), 20000);
}

private double parsePrice(String priceString) {
    return Double.parseDouble(priceString) / 100;
}

private void updateItemDetails(Item item, double price, double buy) {
    item.setPrice(price);
    item.setBuyOrder(buy);

    try {
        item.setPercentage(round((100 * ((price * 0.975) - buy) / buy), 4));
    } catch (NumberFormatException e) {
        log.error("Error calculating percentage for item: {}", item.getName(), e);
    }

    item.setProfit(round((price * 0.975 - buy), 2));
}
1
Thomas Kläger On

You give the answer yourself:

this method is called in a GET request

Spring (or the webserver it contains) starts multiple threads so that it can handle multiple GET requests in parallel.

As soon as a second (*) client sends second GET request your method will be executed in parallel.

If you mark your method as synchronized only one thread after the other will be able to execute it - at the expense of longer response times for other clients.


*) if the execution of your method takes too long it could even be that the request of the first client times out and this first client resends the same request again which would still end up in a second thread running your method.