Difference between revisions of "80Legs"

From RSWiki
Jump to navigation Jump to search
 
Line 1: Line 1:
{{AdWords}}
 
 
 
== Introduction ==
 
== Introduction ==
 
80Legs is probably the most annoying bot or web crawler currently in existence. It's described as ".. a web crawling service that allows its users to create and run web crawls through its software as a service platform." [http://en.wikipedia.org/wiki/80legs]" ''Wikipedia entry on 80legs''
 
80Legs is probably the most annoying bot or web crawler currently in existence. It's described as ".. a web crawling service that allows its users to create and run web crawls through its software as a service platform." [http://en.wikipedia.org/wiki/80legs]" ''Wikipedia entry on 80legs''
Line 24: Line 22:
 
You can modify your .htaccess file to block the useragent however this could be fairly intensive. Also if you are running a [[Reverse Proxy with Squid|Reverse Proxy]] like Squid you can use it to [[Blocking Bots with Squid |block the useragent]].
 
You can modify your .htaccess file to block the useragent however this could be fairly intensive. Also if you are running a [[Reverse Proxy with Squid|Reverse Proxy]] like Squid you can use it to [[Blocking Bots with Squid |block the useragent]].
  
However given the proliferation of this bot and because it can come from anywhere in the world I've decided simply to block those IP addresses.  
+
However given the proliferation of this bot and because it can come from anywhere in the world I've decided simply to block those IP addresses.
 
 
{{AdWords2}}
 

Latest revision as of 13:52, 4 September 2017

Introduction

80Legs is probably the most annoying bot or web crawler currently in existence. It's described as ".. a web crawling service that allows its users to create and run web crawls through its software as a service platform." [1]" Wikipedia entry on 80legs

The crawler purely exists for 80legs commercial gain and will never drive traffic to any site. Instead they are commissioned by someone who will pay them big bucks for the information they retrieve with little or no benefit to the site owner.

In reality it is more like Distributed Denial of Service attack that ignores robots.txt and can bring a Web Server to its knees. Because of it's distributed nature the crawlers can come from anywhere in the globe and they eat bandwidth like there is no tomorrow. I've decided to take the drastic action of using IPtables to block the IP addresses of these crawlers.

The Crawler

The crawler usually identifies with the following string:

"Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"

To get an idea of how prolific this crawler you just parse your access logs with a command similar to this:

cat /path/to/log/access.log | grep 80legs -c

For me this returned 48866 results in a 7 day period!

Blocking the bot

You can modify your .htaccess file to block the useragent however this could be fairly intensive. Also if you are running a Reverse Proxy like Squid you can use it to block the useragent.

However given the proliferation of this bot and because it can come from anywhere in the world I've decided simply to block those IP addresses.