Web ADV: Die!
Like everybody (I think) I have my list of favourite web sites, that I read almost every day, most of those are news and/or technical info, and most of them are self-supporting with advertising.
I've got nothing against the fact that those people are trying to get some food out of their business, what I can't stand at is the amount of "intrusive" ADV that are appearing on their pages. For 'intrusive' I mean the kind of adv that can't just be on the page, no, it has to jump up and down in front of it yelling 'LOOK AT ME!!' so hard that my glasses crack every time.
I hate this!
Like everyone (or almost everyone) I use some 'tricks' to avoid the most annoying ad. Like adding redirect to 127.0.0.1 for the ad-riddled sites, firewalls rules to block the ad-providing sites and so on. But most of those tricks works only if you use always the same pc or you remember to copy/update them from one machine to another.
Today, since I was at home and the weather sucks, I decided to tackle the problem at the root, and implement a Transparent Proxy with filtering.
A proxy server like Squid is equipped with a vast array of features that can be used to filter and disallow certains things. In specific, he can disallow the access to one or more sites based on specific part of the url, the domain, the ips and so on. If the site can't be reached, it can't provide the adv.
There are various way to use a proxy, the simplest is to install and configure it and then tell your browser to use it, every browser has a different way to do it. This is fine, if you don't have to move the pc (or laptop) in and out of different networks, otherwise is a pain.
Another way is to provide 'autoconfigurations' scripts in various way.
The last (and best) way, is to use a 'transparent' proxy, so the proxy is in between you and the 'normal' web but you don't see it. You only use it without knowing. This avoid the burden of configuring the machines to use the proxy and to mantain them. But it require a little firewall configuration to 'hijack' the connections and direct them to the proxy.
Yes, first you need to configure the proxy to accept standard http requests (not proxy requests) and to treat them as proxy requests, then you need to use iptables (or ipfilter) to redirect all the outgoing http connections to the proxy.
Squid uses ACLs (Access Control Lists) to filter connections, the ACLs are rules that allow to specify urls, ports, domains and the like, and then what's allowed and what is not allowed (denied). Squid's documentation contains all the instructions, for my goals only two kind of ACLs are used: one to filter on target domains and one to filter on URLs using a regexp to match part of the url.
This allow me to block access to all the ad-related domains and sites.
Let's see how to use ACLs:
First you need to define an ACL and assign a name to it, to do so you need a line in the configuration file like this:
acl aclname acltype parameters
Where 'aclname' is (of course) the name you want to assign, the type is the type of the acl (what's going to be checked) and the parameters are what to check for.
After an ACL has been defined, it can be used to block or not-block the access with a line like this:
http_access [deny|allow] aclname
This will refuse (deny) or allow the access if the acl is matched or not by the request.
A full example:
acl deathtoad0 url_regex .*ads.* acl deathtoad1 url_regex .*pagead.* acl deathtoad2 url_regex .*doubleclick.* http_access deny deathtoad0 http_access deny deathtoad1 http_access deny deathtoad2
This will refuse access to every URL containing 'ads' or 'pagead' or 'doubleckick' in every position of the url.
Like I said before, to make a proxy 'transparent' first of all is necessary to redirect all the outgoing http connections to the proxy and his port. Then you need to tell the proxy to handle http request like proxy requests.
For the first step, a firewall rule like the following is required:
iptable -t nat -A PREROUTING -i $laninterface -p tcp --dport 80 \ -j DNAT --to ip.of.proxy.server:$proxyport
Where 'laninterface' is the network interface with the lan and 'ip.of.proxy.server' is the ip of the proxy server and 'proxyport' is the port used (of course).
This simple rule hijack all the outgoing connections and redirect them to the proxy. Now is the time to configure the proxy.
There are a number of documents on the subject, and I suggest you read them. But the core is that you need to add or change a few parameters:
httpd_accel_host virtual httpd_accel_port 80 httpd_accel_with_proxy on httpd_accel_uses_host_header on
This allow Squid to manage all http request as a proxy request and act as a proxy even if it's not called as a proxy.
After having installed and configured Squid as a "normal" proxy (forget the 'transparent' bit for the moment), try it out to see if it works and if the ACLs you defined are correct. Maybe you need to tweak them to block whatever site you want to block and un-block what you don't want to block. When everything is fine, you can add the transparent bits and the firewall rule.
Remember that, if something isn't right, you can always remove the firewall rule and everything will work like before, ignoring the proxy.
One of the plus to use a proxy is the possibility to have statistics about the most used web sites, so to tweak and trim the cache to achieve better performances.
To obtain statistics, first you need to activate logging, this can be done by using the following parameters in the proxy's configuration file:
cache_access_log /where/do/you/want/it/squid.log emulate_httpd_log on
The 'emulate' parameter is used to get the log written like an http log, so you can use all the web-analyzer to process it. I prefer to use webalizer, but other can be used too.
This is an example of webalizer config file to process a squid log:
LogFile /where/is/squid.log LogType squid OutputDir /where/do/you/want/output HistoryName squid Incremental yes IncrementalName squid.current ReportTitle proxy statistics HostName ilvostronomehost HTMLExtension html PageType htm* PageType pl PageType cgi Quiet yes ReallyQuiet yes AllSites yes AllReferrers yes AllURLs yes AllUsers yes
This generate a statistics with the most used sites and who looked at them, watch out for privacy law and other things in your area if you use this on a company or office-wide scope.
Using a transparent proxy to filter the html is useful, but watch out for the load you put on the proxy itself (it must be able to handle all the trafic, disk-cache-wise) and if you start logging and stat'ing the access watch out for privacy-related problems.
Remember to add some logrotation things on your configuration if you activate the logging.
Davide Bianchi, works as Unix/Linux administrator for an hosting provider in The Netherlands.
Do you want to contribute?
This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.
This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.
This site isn't optimized for vision with any specific browser, nor
it requires special fonts or resolution.
You're free to see it as you wish.