As our business has grown over the past few years we have seen the expected and welcome result of increased traffic to our sites. The downside to all this activity is the extra pressure it puts on our application, both in terms of increased load on the database, and increased bandwidth. Now this is a good problem for a web development company to have, as long as the traffic is desirable traffic, and by desirable I mean either real users browsing our websites, or welcome spiders such as the major search engines.
Recently though we undertook a project to try to identify how many requests we were receiving from unwelcome sources, such as email harvesters, comment spammers and the like. If we could reduce the frequency of these requests, we would potentially see many benefits, including
So how did we go about this? Read on…
Project Honeypot is a great online resource that logs IP addresses that have been involved in suspicious practices. An IP address on its register is categorised as Suspicious, Harvester, Comment Spammer, Search Engine, or Unknown, or possibly a combination of these.
The IP address register is a fluid list that all of the members of the Project Honeypot community contribute to. When a website development company signs up as a member of the community, it has the option of adding a honeytrap page to its web sites. This is a page that looks innocuous enough, but is hidden from users. Bots, though, can crawl every page on your site (particularly those bots that are not thoughtful enough to heed your robots.txt page), and this includes the honeytrap. It’s simple enough to conclude that any IP address that records a visit to this page is at the very least worthy of suspicion.
There are a number of ways for a web developer to link to a honeytrap page in such a way as to hide them from real users, but attract the attentions of bots. These include
We’ve found that by including a combination of these in a function that is called on every page means that we’ve been able to attract a lot of bots to our honeytrap, resulting in more rogue IP addresses logged on the Project Honeypot register. With a large online community of sites with honeypot pages provided by Project Honeypot, this means that the web is seeded with countless trap pages recording the visits of potentially nefarious bots and spiders.
Even if a web development company does not want to contribute to capturing IP addresses, it can still use the register as a free resource against which to compare IP addresses. Simply posting a request to http://www.projecthoneypot.org/ip_#IPAddress# will return information on that IP address, including the country of origin, user agent strings associated with its use, number of sightings, associated IP addresses, and a history of malicious use. Now, manually calling this page is all very well for the odd IP address that you need to check, but Project Honeypot also provide a service that allows you to call this programmatically. This is how we use the register. When a request comes to one of our sites, we identify the IP address. This can be found in the Remote_Addr value of the CGI scope. In the case of our site, where requests are handled by a load balancer, and arrive with that device’s IP address, we get it from the X-Cluster-Client-IP value.
We have a database table of all IP addresses that have visited our sites, and their status, which can be either unchecked, approved, blacklisted or whitelisted. If this IP address has never been seen before on our sites, we simply add a new row to the table for that IP address with an unchecked status. The idea is that all unchecked IP addresses are sent to Project Honeypot to see if they are malicious, but we don’t do this in real time, due to the performance hit. Rather we run a scheduled task every five minutes that loops through all unchecked IP addresses, and sends them over for review.
Each IP address that is sent to Project Honeypot is either on their list of suspicious IP addresses or not. Those that are viewed as safe, we mark as approved. Those that are viewed as unsafe, we mark as blacklisted. As of the time of writing, we have sent 350,000 IP addresses to Project Honeypot for review. Of these, 6,500 have been identified as unsafe and marked accordingly.
Next time round, I’ll be posting the second half of this article, explaining what it means for an IP address to be blacklisted on our sites, how we handle false positives, and other approaches we take to identifying bots.
![]() |
| Smart50 Awards 2011 |
![]() |
| 2012 BRW Fast Starters |
|
"My online store with The Web Showroom makes more sales, ranks higher in Google and is easier to update than my old site. I haven’t looked back and can confidently recommend them to build your website." Peter Boyce, Owner |
| 1800 981 442 |