Blocking Bots with Squid: Difference between revisions

From RSWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 64: Line 64:


Well that depends on what bots and useragents you would like to block. There are many sites out on the net that have comprehensive lists of all bad bots and browsers but in reality it would be unlikely for you to see all but a small fraction of them. However there is nothing stopping you from adding them to your badbrowsers.conf. You can use the latest version of my one as a starting point. My example below has been in use for almost 1 year at the time of writing this and I have only needed to edit it a handful of times since.
Well that depends on what bots and useragents you would like to block. There are many sites out on the net that have comprehensive lists of all bad bots and browsers but in reality it would be unlikely for you to see all but a small fraction of them. However there is nothing stopping you from adding them to your badbrowsers.conf. You can use the latest version of my one as a starting point. My example below has been in use for almost 1 year at the time of writing this and I have only needed to edit it a handful of times since.
== Extracting Offending IP addresses from logs ==
Using the methods above returns a 403 error. If like me you would like to ban those IP addresses then you need a method to extract the IP addresses from Squid's logs.
You can do this by entering the following at the command line:
tail -n 5000 access.log | grep "403" | awk '{print $1}' | uniq -d > file.txt
Your paths may vary according to your operating system. This command scans the last 5000 lines of the access log and searches for a 403 error. Using awk it extracts the first part of the log entry which in my setup is the clients IP address. Finally it outputs the unique IP addresses to file.txt.
'''Important!''' This method of extraction depends on having Squid configured to log to a custom log format and it will not work on the default Squid logs. The custom format that I use is documented on my [[Reverse Proxy with Squid]] page.
You can now use the extracted IP addresses to block them on your firewall.


== My List of Bots to block ==
== My List of Bots to block ==

Revision as of 10:58, 20 March 2009

Template:AdWords

Introduction

This HowTo assumes that like me, you are using Squid as a Reverse Proxy. If you are not then you may find this document useful nonetheless.

Background

If you are using multiple backend servers with Squid as a frontend or as a load balancer you may be interested in reducing the amount of bandwidth that is wasted by nefarious crawlers, bots and spambots. With a single server facing the internet you may be familiar with using .htaccess and mod_rewrite to provide similar functionality however if some of your backend servers are not running Apache or any other server that doesn't support mod_rewrite then you can use Squid in a similar fashion to block certain traffic before they reach your backend server.

Since I implemented my solution I have seen the numbers of comment spam on my blog reduced from around 100 per day to between two and 10!

Getting Started

I will assume that you have Squid configured and serving as a reverse proxy. We will be using regular expressions (regex) stored in a text file so first off from a terminal or at the console you need to create this file. For simplicities sake I suggest storing this file along with the rest of your Squid configuration files. In most scenarios this will be in /etc/squid so navigate to your configuration directory and create your file as follows:

touch badbrowsers.conf

Next we need to configure Squid so that on startup or configuration reload to use the contents of this file or more specifically to block content based on the contents of this file. For this you need to edit your squid.conf and define an acl, so open squid.conf and locate the section with the refresh_pattern definitions. After these you will need to add a line with the following:

acl badbrowsers browser "/etc/squid/badbrowsers.conf"

Then you need to add a proxy restriction so navigate through squid.conf to your http_access acls are and above the definitions for your backend servers add the following:

http_access deny badbrowsers

Save squid.conf and that is the first few steps completed. Upon startup squid should now query the contents of badbrowsers.conf

Blocking Bots

To get started we will block all visitors that identify themselves via their useragent identifier. So for our first example we will block the libwww crawler. You need to open badbrowsers.conf and simply add the following line:

^libwww

After saving your file you need to reload Squid's configuration. On my server this is done by issuing the following command from a terminal:

/etc/init.d/squid reload

From now on any client identifying itself as Libwww will get a 403 error in response. You may have a list of useragents that you would like to block so add each one to new line and reload Squid's configuration for the changes to take effect.

Fine Tuning

We can now fine tune our blocking criteria based on more detail. I like to deny access on relatively strict criteria, the most effective of which has been to deny access to old browsers that no one in their right mind should be using to browse the web. To achieve this we need to get a bit more technical in our regular expressions but nothing too severe.

For the first example we will block clients claiming to be using Microsoft Internet Explorer 3 which hasn't been in use for a decade!

First up we need to understand the browser string itself. All browsers identify themselves and their version number although this won't be evident if you scan through Squid's native log format. However a recent useragent claiming to be Microsoft Internet Explorer 3 that attempted to spam my blog identified itself as follows:

Mozilla/2.0 (compatible; MSIE 3.01; Windows 98)

All browsers tend to identify themselves with a Mozilla prefix, even all versions of Internet Explorer so using a similar method to the one we used to block libwww is not an option since it would deny access to almost everyone. But you can see from the string that we have a few ways to block this bot and others like it. The first being the Mozilla version.

We are going to block this useragent using Mozilla/2.0 as our regular expression. So edit your badbrowsers.conf once more on a new line add the following:

^Mozilla/2\.

Save and reload the Squid's configuration and now the next time this bot visits it will be greeted with a 403 denied response.

Another regular expression that we can use to achieve the same result is by using the MSIE 3.01 identifier. To do this you would add the following regular expression to your badbrowsers.conf:

(.*?)MSIE 3

From these two examples we now have our basis for blocking traffic from browsers that no one in their right mind should be using. A slight variation is needed for blocking older versions of Firefox however. This is because all versions of Firefox identify themselves as Firefox/x where x is the version number. So if you would like to block visitors claiming to be using Firefox 0.9.6 or whatever we can use the following regular expression:

(.*?)Firefox/0

What Next?

Well that depends on what bots and useragents you would like to block. There are many sites out on the net that have comprehensive lists of all bad bots and browsers but in reality it would be unlikely for you to see all but a small fraction of them. However there is nothing stopping you from adding them to your badbrowsers.conf. You can use the latest version of my one as a starting point. My example below has been in use for almost 1 year at the time of writing this and I have only needed to edit it a handful of times since.

Extracting Offending IP addresses from logs

Using the methods above returns a 403 error. If like me you would like to ban those IP addresses then you need a method to extract the IP addresses from Squid's logs.

You can do this by entering the following at the command line:

tail -n 5000 access.log | grep "403" | awk '{print $1}' | uniq -d > file.txt

Your paths may vary according to your operating system. This command scans the last 5000 lines of the access log and searches for a 403 error. Using awk it extracts the first part of the log entry which in my setup is the clients IP address. Finally it outputs the unique IP addresses to file.txt.

Important! This method of extraction depends on having Squid configured to log to a custom log format and it will not work on the default Squid logs. The custom format that I use is documented on my Reverse Proxy with Squid page.

You can now use the extracted IP addresses to block them on your firewall.

My List of Bots to block

Here are the contents of my badbrowsers.conf:

Netscape
(.*?)MSIE 5
(.*?)MSIE 4
(.*?)MSIE 3
(.*?)Firefox/0
(.*?)Firefox/1.0
^Speedy
^Nutch
^Java
^shelob
^ISC
^TrackBack
^ichiro
^bot
^HouxouCrawler
^User
^user
^Attentio
^Mp3Bot
^NameOfAgent
^yacybot
^WWW-Mechanize
^MLBot
^WebAlta
^MJ12bot
^Python
^libwww
^Libwww
^trackback
^heretrix
^jakarta
^Jakarta
^LibWWW
^Mozilla/1\.
^Larbin
^larbin
^Gator
^gator
^Mozilla/3\.
^Mozilla/4\.
^Opera/7\.
^Opera/6\.
^Opera/5\.

Template:AdWords2