Grabbing\spider protection

130 Views Asked by Roman Bodnarchuk At 18 May 2025 at 04:30

There is a site\resource that offers some general statistic information as well as an interface to search facilities. This search operations are costly, so I want to restrict frequent and continuous (i.e. automatic) search requests (from people, not from search engines).

I believe there are many existing techniques and frameworks that perform some intelligence grabbing protection, so I don't have to reinvent a wheel. I'm using Python and Apache through mod_wsgi.

I am aware of mod_evasive (will try to use it), but I'm also interested in any other techniques.

Original Q&A

There are 2 best solutions below

Peter Downs On 19 December 2011 at 14:53

You could try a robots.txt file. I believe you just put it at the root of your application, but that website should have more details. The Disallow syntax is what you're looking for.

Of course, not all robots respect it, but they all should. All the big companies (Google, Yahoo, etc.) will.

You may also be interested in this question about disallowing dynamic URLs.

Ivan Blinkov On 01 November 2012 at 04:46

If someone's hunting exactly your website and data there 's really worthy - nothing will stop the smart enough attacker in this case.

Though there are some things worth trying:

Keep counters of search usage from specific IPs and User-Agents. Block them when some minutely/hourly/daily thresholds are reached.
Use blacklists of potentially harmful IPs or threat levels (for example you can use Cloudflare API for that)
Cache the frequent search results to make them less costly
It's probably a bit crazy, but you can render that statistics on images or via flash/java applets - it will make them much more challenging to grab
A bit similar to previous one: use some tricky API to access search results, for example it can be ProtocolBuffers over WebSockets. So someone will probably need a full-blown browser to grab that or at least have to build some trickery around node.js. Downside - you'll lose legitimate clients using old browsers.

Grabbing\spider protection

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in APACHE

Related Questions in MOD-WSGI

Related Questions in HIGH-LOAD

Related Questions in MOD-EVASIVE

Trending Questions

Popular # Hahtags

Popular Questions