I'm getting a very strange error here that I can't quite understand. As it stands I'm making a crawler program that uses ForkJoinTasks to scan for links on the page and then crawl those as well. However, when I use .join() to collect the tasks, I sometimes get ConcurrentModificationExceptions pointing to that line.
Spider[] newSpiders = crawlForURLs("a", "href").stream()
.filter(url -> SpiderUtils.isLinkURLValid(url))
.map(url -> new Spider(url).fork()).toArray(Spider[]::new);
for (Spider c : newSpiders) {
Image[] crawledImages = c.join();
}
The line in question is c.join(). I tried surrounding it in a ReentrantLock just to see what would happen but it made no difference.
Full stack trace: https://pastebin.com/qCJte65Q
Lines of code are:
at com.eulerity.hackathon.imagefinder.Spider.compute (Spider.java:85) (Image[] crawledImages = c.join();)
at com.eulerity.hackathon.imagefinder.ImageFinder.doPost (ImageFinder.java:45) images = SpiderUtils.commonPool.invoke(baseSpider);
I honestly can't figure out why this is the case, but if you need to see another part of my code please don't hesitate to ask. Thank you!
The problem, I believe, is in your
SpiderUtilsclass. More specifically, this method:And this method:
Both these methods are not
synchronized, which means multiple threads can execute them at the same time. This is likely not a problem for thecontainsURL(...)method. Given what it does, it has no reason to besynchronized(neither it, nor theurlsMatch(...)method, rely on any external state). The problem is with how theisLinkURLValid(...)method passes bothlinksCrawledandlinksInProgressto thecontainsURL(...)method.These list variables are static, and thus the same instances are passed to the
isLinkURLValid(...)method on different threads. This would be okay, if all you did was stream the elements of these lists (i.e., only read from the lists). However, elsewhere in your code you also add elements to these lists (i.e., you write to the lists). These writes are performed insynchronizedmethods, but that doesn't matter because you're reading the lists without synchronizing on the same object. So, your code is vulnerable toConcurrentModificationExceptions (you cannot modify a non-concurrent collection while iterating over it).I believe the solution would be to simply make the
isLinkURLValid(...)methodsynchronized, but I'm not 100% positive as I haven't tested it.