Blocking Bots with Squid
This HowTo assumes that like me, you are using Squid as a Reverse Proxy. If you are not then you may find this document useful nonetheless.
If you are using multiple backend servers with Squid as a frontend or as a load balancer you may be interested in reducing the amount of bandwidth that is wasted by nefarious crawlers, bots and spambots. With a single server facing the internet you may be familiar with using .htaccess and mod_rewrite to provide similar functionality however if some of your backend servers are not running Apache or any other server that doesn't support mod_rewrite then you can use Squid in a similar fashion to block certain traffic before they reach your backend server.
Since I implemented my solution I have seen the numbers of comment spam on my blog reduced from around 100 per day to between two and 10!
I will assume that you have Squid configured and serving as a reverse proxy. We will be using regular expressions (regex) stored in a text file so first off from a terminal or at the console you need to create this file. For simplicities sake I suggest storing this file along with the rest of your Squid configuration files. In most scenarios this will be in /etc/squid so navigate to your configuration directory and create your file as follows:
Next we need to configure Squid so that on startup or configuration reload to use the contents of this file or more specifically to block content based on the contents of this file. For this you need to edit your squid.conf and define an acl, so open squid.conf and locate the section with the refresh_pattern definitions. After these you will need to add a line with the following:
acl badbrowsers browser "/etc/squid/badbrowsers.conf"
Then you need to add a proxy restriction so navigate through squid.conf to your http_access acls are and above the definitions for your backend servers add the following:
http_access deny badbrowsers
Save squid.conf and that is the first few steps completed. Upon startup squid should now query the contents of badbrowsers.conf
To get started we will block all visitors that identify themselves via their useragent identifier. So for our first example we will block the libwww crawler. You need to open badbrowsers.conf and simply add the following line:
After saving your file you need to reload Squid's configuration. On my server this is done by issuing the following command from a terminal:
From now on any client identifying itself as Libwww will get a 403 error in response. You may have a list of useragents that you would like to block so add each one to new line and reload Squid's configuration for the changes to take effect.
We can now fine tune our blocking criteria based on more detail. I like to deny access on relatively strict criteria, the most effective of which has been to deny access to old browsers that no one in their right mind should be using to browse the web. To achieve this we need to get a bit more technical in our regular expressions but nothing too severe.
For the first example we will block clients claiming to be using Microsoft Internet Explorer 3 which hasn't been in use for a decade!
First up we need to understand the browser string itself. All browsers identify themselves and their version number although this won't be evident if you scan through Squid's native log format. However a recent useragent claiming to be Microsoft Internet Explorer 3 that attempted to spam my blog identified itself as follows:
Mozilla/2.0 (compatible; MSIE 3.01; Windows 98)
All browsers tend to identify themselves with a Mozilla prefix, even all versions of Internet Explorer so using a similar method to the one we used to block libwww is not an option since it would deny access to almost everyone. But you can see from the string that we have a few ways to block this bot and others like it. The first being the Mozilla version.
We are going to block this useragent using Mozilla/2.0 as our regular expression. So edit your badbrowsers.conf once more on a new line add the following:
Save and reload the Squid's configuration and now the next time this bot visits it will be greeted with a 403 denied response.
Another regular expression that we can use to achieve the same result is by using the MSIE 3.01 identifier. To do this you would add the following regular expression to your badbrowsers.conf:
From these two examples we now have our basis for blocking traffic from browsers that no one in their right mind should be using. A slight variation is needed for blocking older versions of Firefox however. This is because all versions of Firefox identify themselves as Firefox/x where x is the version number. So if you would like to block visitors claiming to be using Firefox 0.9.6 or whatever we can use the following regular expression:
Well that depends on what bots and useragents you would like to block. There are many sites out on the net that have comprehensive lists of all bad bots and browsers but in reality it would be unlikely for you to see just a small fraction of them. However there is nothing stopping you from adding them to your badbrowsers.conf. You can use the latest version of my one as a starting point. My example below has been in use for almost 1 year at the time of writing this and I have only needed to edit it a handful of times since.
My List of Bots to block
Here are the contents of my badbrowsers.conf:
Netscape (.*?)MSIE 5 (.*?)MSIE 4 (.*?)MSIE 3 (.*?)Firefox/0 (.*?)Firefox/1.0 ^Speedy ^Nutch ^Java ^shelob ^ISC ^TrackBack ^ichiro ^bot ^HouxouCrawler ^User ^user ^Attentio ^Mp3Bot ^NameOfAgent ^yacybot ^WWW-Mechanize ^MLBot ^WebAlta ^MJ12bot ^Python ^libwww ^Libwww ^trackback ^heretrix ^jakarta ^Jakarta ^LibWWW ^Mozilla/1\. ^Larbin ^larbin ^Gator ^gator ^Mozilla/3\. ^Mozilla/4\. ^Opera/7\. ^Opera/6\. ^Opera/5\.