80Legs

From RSWiki
Jump to: navigation, search




Introduction

80Legs is probably the most annoying bot or web crawler currently in existence. It's described as ".. a web crawling service that allows its users to create and run web crawls through its software as a service platform." [1]" Wikipedia entry on 80legs

The crawler purely exists for 80legs commercial gain and will never drive traffic to any site. Instead they are commissioned by someone who will pay them big bucks for the information they retrieve with little or no benefit to the site owner.

In reality it is more like Distributed Denial of Service attack that ignores robots.txt and can bring a Web Server to its knees. Because of it's distributed nature the crawlers can come from anywhere in the globe and they eat bandwidth like there is no tomorrow. I've decided to take the drastic action of using IPtables to block the IP addresses of these crawlers.

The Crawler

The crawler usually identifies with the following string:

"Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"

To get an idea of how prolific this crawler you just parse your access logs with a command similar to this:

cat /path/to/log/access.log | grep 80legs -c

For me this returned 48866 results in a 7 day period!

Blocking the bot

You can modify your .htaccess file to block the useragent however this could be fairly intensive. Also if you are running a Reverse Proxy like Squid you can use it to block the useragent.

However given the proliferation of this bot and because it can come from anywhere in the world I've decided simply to block those IP addresses.