DansGuardian: A Content Filtering System

From linsec.ca
Jump to: navigation, search

One feature many firewall appliances have been pushing recently is content filtering proxies, whether transparent or authenticated. These content filtering proxies are a boon to individuals with young children in the house, but many of them are extremely basic. Without pointing any fingers, I had purchased a content filtering firewall appliance that promised content filtering and was sorely disappointed. The content filtering was extremely basic and was solely word-based. Unfortunately, this word list was something the end user had to enter in by hand. So if you're looking to keep your children from stumbling across some pornographic web pages, you have to get pretty creative to populate your word list. The other drawback to this particular appliance was that port forwarding didn't work. Eventually it was this drawback that convinced me to sell it; my daughter is only a year and a half and won't be surfing solo anytime soon.

As a result, I took an unused desktop computer and installed MandrakeSoft's Multi-Network Firewall 8.2 on it, to replace the now removed firewall appliance. While the end result may have been more expensive, you can't put a price on flexibility. And by using a Linux-based firewall operating system, I get all the flexibility I want.

One way you can configure MNF is to use Squid as a transparent proxy, which is ideal. No reconfiguration of the LAN to point everything to a proxy. Cached web pages. And the ability to use DansGuardian as a content filter. MNF also comes with squidGuard which is nice, but doesn't seem to be as flexible as DansGuardian when it comes to content filtering.

DansGuardian has a few requirements. It requires Squid for the web proxy, and it requires a web server like Apache. It does not require MNF, and while this was written to use DansGuardian on MNF, it will run on a variety of operating systems including any Linux, FreeBSD, OpenBSD, and even OS X (although at the time of this writing, the OS X support is alpha quality).

Configuring DansGuardian

Building DansGuardian is very straightforward. For the purpose of this tutorial, we will assume DansGuardian is installed in the system so the binary is in /usr/sbin and the configuration files are in /etc/dansguardian. DansGuardian comes with a logrotation script that is installed into /etc/dansguardian, called logrotation. This should be executed weekly, so you should add the following to your crontab (as root):

59 23 * * sat /etc/dansguardian/logrotation

You can modify your system crontab by executing crontab -e. Another alternative, for systems that use logrotate is to create a file called dansguardian in your /etc/logrotate.d directory that looks like this:

/var/log/dansguardian/access.log {
    rotate 5

If you do choose to use the script that comes with DansGuardian make sure you chmod 0700 the script to make it executable.

To start DansGuardian you can use the SysV-style initscript (ie. Mandrakelinux packages come with /etc/rc.d/init.d/dansguardian which can be started by using service dansguardian start), or you can just execute dansguardian on the commandline. If you start DansGuardian in this way, you can use the typical "kill" method of stopping it, or use dansguardian -q.

The main configuration file for DansGuardian is /etc/dansguardian/dansguardian.conf. There are a number of other files included, these are the banned lists and exception lists. These files all reside in the /etc/dansguardian directory as well. Every time you make a change to any of these files, you will need to restart DansGuardian, and this can be accomplished by executing dansguardian -r as root.

The following files make up the overall configuration of DansGuardian:

File Description
exceptioniplist This file contains a list of client IP addresses that you wish to allow unrestricted access (no filtering).
exceptionphraselist This file contains a list of phrases that, if they appear in a web page, will bypass filtering. You may want to use the weightedphraselist instead, as this can result in a lot of pages not being blocked.
exceptionsitelist This file contains a list of domain endings that if found in the requested URL, will not be filtered.
exceptionurllist This file contains a list of URL parts for sites where filtering should be turned off.
exceptionuserlist This file contains a list of usernames that will not be filtered (you must use basic authentication or ident must be enabled for this to work).
bannedextensionlist This file contains a list of file extensions that will be banned. This can be used to restrict users from downloading screen savers, executable files, viruses, and so forth.
bannediplist This file contains a list of client IP addresses that will not get web access at all.
bannedmimetypelist This file contains a list of MIME-types that will be banned. If a URL request returns a MIME-type in this list, DansGuardian will block it. This can be used to block movies, but shouldn't be used to graphic image files or text/html, etc.
bannedphraselist This file contains a list of phrases that will result in banning a page. Each phrase must be enclosed between < and > characters, and they may contain spaces. You can also use a combination of phrases that, if all are found in a page, will result in it being blocked.
bannedregexpurllist This file contains a list of regular expression URLs that will be banned. that will be banned. This can be used to restrict users from downloading screen savers, executable files, viruses, and so forth.
bannedsitelist This file contains a list of sites that are to be banned. You can use IP addreses here as well as domain names, and can even include stock SquidGuard blacklists as well.
bannedurllist This file contains a list of URL parts to block, which allows you to block parts of a site rather than the entire site. You can also use SquidGuard lists here as well.
banneduserlist This file contains a list of usernames to whom, if basic proxy authentication is enabled, access will be denied automatically.
weightedphraselist This file contains a list of phrases with a corresponding positive or negative value. As phrases are encountered in a page, the total "value" of the page will be calculated based on these values; good phrases will have negative values and bad phrases will have positive values. One the Naughtiness Limit has been reached (defined in dansguardian.conf, the page will be blocked.
pics This file contains a number of PICS sections that allow you to fine-tune your PICS filtering. The defaults for DansGuardian are for young children (mild profanity, artistic nudity, etc.).

Each of these configuration files are very straightforward and are basically one item per line (ie. a URL or IP address, etc.).

The dansguardian.conf file is the primary configuration file for DansGuardian. It is here that you will configure things like logging, where to redirect users when attempting to access a denied page, and so forth. The file is heavily commented and fairly straightforward.

An example dansguardian.conf file without comments follows:

reportinglevel = 2
htmltemplate = '/etc/dansguardian/template.html'
accessdeniedaddress =

loglevel = 3

filterip =
filterport = 3328
proxyport = 3128
proxyip =

bannedphraselist = '/etc/dansguardian/bannedphraselist'
exceptionphraselist = '/etc/dansguardian/exceptionphraselist'
weightedphraselist = '/etc/dansguardian/weightedphraselist'
bannedsitelist = '/etc/dansguardian/bannedsitelist'
exceptionsitelist = '/etc/dansguardian/exceptionsitelist'
exceptionurllist = '/etc/dansguardian/exceptionurllist'
bannedurllist = '/etc/dansguardian/bannedurllist'
bannedregexpurllist = '/etc/dansguardian/bannedregexpurllist'
bannedextensionlist = '/etc/dansguardian/bannedextensionlist'
bannedmimetypelist = '/etc/dansguardian/bannedmimetypelist'
bannediplist = '/etc/dansguardian/bannediplist'
exceptioniplist = '/etc/dansguardian/exceptioniplist'
banneduserlist = '/etc/dansguardian/banneduserlist'
exceptionuserlist = '/etc/dansguardian/exceptionuserlist'
picsfile = '/etc/dansguardian/pics'

weightedphrasemode = 1
naughtynesslimit = 50
showweightedfound = on

reverseaddresslookups = off
createlistcachefiles = on
maxuploadsize = -1

username_id_method_proxyauth = off
username_id_method_ntlm = off # **NOT IMPLEMENTED**
username_id_method_ident = on

forwarded_for = off
maxchildren = 120
log_connection_handling_errors = on

This is a fairly standard configuration one might have; you can even use it verbatim provided you change IP addreses and port settings to match your own system.

The reportinglevel setting tells DansGuardian to fully report why access was denied (ie. give the denied phrase). You may choose to use a level of 1 instead, or 3 to use the HTML template file. If you use the HTML template file, then the htmltemplate file needs to be set to the full path and filename of the template file you wish to use. If you use a setting of 0 through 2 you will need to set the accessdeniedaddress keyword. In this case, it's pointing to the internal IP address of our firewall (, and the port it listens to (in this case port 8444). It also contains the full path to the dansguardian.pl CGI script.

The loglevel keyword is to determine what gets logged to the /var/log/dansguardian/access.log logfile.

The filterip determines what IP address that DansGuardian will listen on. If left blank, all IPs will be listened on. The filterport keyword is the port that DansGuardian will bind to. The proxyip is the IP address of the proxy; usually the localhost. The proxyport is the port to use to connect to the proxy (in this case, 3128, which is the port that Squid is listening on).

The keywords following are all related to the various configuration files discussed earlier, and simply include them to be a part of the configuration.

The weightedphrasemode determines how weighted phrases are used. A setting of 1 is for normal operation. The naughtynesslimit keyword sets the limit over which a page will be blocked. This is based on the values of the weightedphraselist file and each "hit" on a page will modify the naughtyness of the page. The higher the rating, the "naughtier" the page. As a general rule of thumb, with the default settings, a limit of 50 is suitable for young children, 100 for older children, and 160 for young adults.

The showweightedfound keyword determines whether the phrases found that made up the total that exceeds the naughtyness limit will be logged, and if reportinglevel is set to 2, reported.

The reverseaddresslookups keyword determines whether or not DansGuardian will look up the forward DNS for an IP URL address and search for both the banned site and URL lists. This is useful for preventing a user from simply entering the IP address for a banned site. It can also have an impact on the searching speed, however.

The createlistcachefiles keyword determines whether or not the bannedsitelist and bannedurl files will be cached. Fast computers do not need this, but on slower computers this could result in a significant process start speed increase.

The maxuploadsize keyword is used for POST protection on web upload forms. A setting of -1 disables, a setting of 0 blocks completely, and any other value sets the file upload size in kilobytes (after MIME encoding and headers).

The forward_for keyword, if enabled, will add an X-Forwarded-For to the HTTP request header. This may be required for some sites that need to know the source IP.

The maxchildren keyword sets the maximum number of processes to spawn to handle incoming connections. This can be used to prevent DoS attacks from killing the server by maxing out spawned processes.

The log_connection_handling_errors keyword is used to determine if DansGuardian will log debug info to syslog.

Configuring Squid

No real special configuration needs to be done with Squid, although the use of Squid is required. DansGuardian will appear to Squid to be a normal web browser. However, the system must be configured in such a way that users cannot bypass DansGuardian, just as they should not be able to bypass Squid. You can do this either with an authenticated proxy (ie. users must log into the proxy to be able to access the network andprovide valid credentials), or you can use a transparent proxy, one where outbound web traffic is routed through the proxy without the end user even knowing it's there; it's all done at the firewall level.

For instance, on a system with just Squid doing web proxying, you might have the firewall redirect all request to outbound port 80 (HTTP) to localhost port 3328 (the proxy). With this method, the end user does not need to reconfigure their browser, and isn't even aware of the proxy at all... until it is unable to connect to a website on their behalf or blocks a site due to DansGuardian.

Taking this further, there is a certain process that needs to be followed. What you want to accomplish is the following:

lan -> fw -> fw -> wan

Or, looking at it a different way:

client -> DansGuardian -> Squid -> server

On the Multi Network Firewall system I use, Squid is configured to list on port 3328; the port that DansGuardian listens to by default. Changing /etc/squid/squid.conf so that Squid listens now on port 3128, DansGuardian can be activated. DansGuardian will always connect to Squid on this port, so you must have Squid listening there. It's a simple matter of modifying one file and restarting Squid, then starting DansGuardian. The firewall rules for the transparent proxy don't even need to be changed as the firewall will have previously forwarded HTTP requests to port 3328 on the localhost.

A Word On Apache

The Apache configuration is extremely straightforward and not really worth mentioning, however for the sake of completeness, here it is. Apache must be running on the firewall, and your firewall rules should ensure that the outside world cannot access it. You could also configure Apache to listen to some non-standard port, perhaps port 8000 or some other un-used port. The installation of DansGuardian should have placed the file dansguardian.pl into your web server's cgi-bin directory. This is all that Apache really needs to serve, so you can lock down Apache to only serve this one particular CGI.

When DansGuardian blocks a page, it will redirect the client to the specified server to view the reasons why the page was blocked, or simply that access was denied (depending upon how you configured it). The Apache server should listen only on the internal interface (LAN-side). You can do this by setting the BindAddress keyword in your Apache configuration files.

PICS Filtering

The pics file contains information for PICS filtering. The first keyword in the file is enablePICS and this determines whether PICS filtering will be used. If it is disabled, all other PICS-related settings are ignored.

PICS stands for Platform for Internet Content Selection. This specification allows for metadata to be associated with internet content. It was designed to help control what children access on the internet, and is used in this capacity in DansGuardian.

Building on PICS is the ICRA, or Internet Content Rating Association. This also provides a rating system for websites (one may often see "We rated with RSAC" slogans on adult websites). The ICRA used to be RSAC (Recreational Software Advisory Council). A number of keywords in the pics file deal with ICRA ratings, to allow you to tailor DansGuardian to your needs. For instance, the keyword ICRAnuditygraphic would allow (1) or disallow (0) graphic nudity. You may want to disallow that, yet allow ICRAnudityeducational. It's entirely up to you. There are RSAC keywords, because many sites still use RSAC instead of using the newer ICRA rating system. The RSAC keywords have a range of 0 (none) to 4 (wanton and gratuitous). A setting of 2 is default.

Yet another rating system is the evaluWEB rating system, which is similar to the British Film classifcation system. The evaluWEBrating keyword can take the following ratings:

  • U - universal; suitable for children unattended
  • 1 - PG; Parental Guidance recommended
  • 2 - 18+; only suitable for adults

CyberNOT is another rating system, and there are two keywords that deal with it; each keyword can be a value of 0 (none) to 8 (lots). The default is 3.

SafeSurf is similar to RSAC, but contains a larger range of categories that can be set from 0 (none) to 9 (wanton and gratuitos). You can set the age of the viewer by using the SafeSurferagerange keyword to 1 (all ages), 3 (early teens, the default), to 9, explicitly for adults.

Weburbia is a similar rating system to evaluWEB.

Finally, Vancouver Webpages is another PICS-based rating system. There are a number of keywords dealing with it, each one is commented as they take different values. The general rule of thumb is that a low number is good, a high number is bad.

To really fine-tune DansGuardian you should spend a little time in the pics file adjusting it for your needs. The PICS, and similar, rating systems can help with a number of sites, but they are all voluntary. In other words, the webmaster of an adult site must voluntarily place the appropriate metadata for rating on their pages; not all webmasters do this. For some it's a matter of being lazy, for others they simply don't care. The (more or less) ethical adult sites will use one or more of these rating systems to protect children from viewing content not suitable for them.

In other words, PICS is a helpful addition to your arsenal, but you will likely see better results using weighted phrases. For instance, to include the stock weighed phrase list on pornography, you would use:

Weighted Phrases

The weightedphraselist file contains instructions on how to use weighted phrases, and includes a number of files dealing with various topics.


Feel free to add your own weighted phrases to this file or create your own and have it included in a similar manner.

This is used by the naughtiness limit that is configured in dansguardian.conf. Adding a positive weight increases the naughtiness rating, while adding a negative weight decreases it. For instance:

< slut ><10>

The first adds 10 to the count on any occurance of the string "slut" in a page (ie. sluts, slut!, abslutxyz).

The second adds 10 the count on any occurance of the word "slut" (ie. Sally is a slut that visits...).

The third adds 50 to the count when the strings "slut" and "horny" are found on the same page.

The forth subtracts 30 from the count when the strings "breast" and "medical" are found on the same page.

Finally, if the string education is found on the page, 25 is subtracted from the count.

You can find a number of tailored files for various age ranges and situations at the DansGuardian homepage in the Extras section.