Learn more about our website design and SEO services. Subscribe via RSS Subscribe via RSS
Garry Viner

Web Development - Spank those bots - Part 1

by Garry Viner
in Building a Website, Website Advice, The Web Showroom, Web Development
28 Nov 2011  | 0 Comments
 

As our business has grown over the past few years we have seen the expected and welcome result of increased traffic to our sites. The downside to all this activity is the extra pressure it puts on our application, both in terms of increased load on the database, and increased bandwidth. Now this is a good problem for a web development company to have, as long as the traffic is desirable traffic, and by desirable I mean either real users browsing our websites, or welcome spiders such as the major search engines.

Recently though we undertook a project to try to identify how many requests we were receiving from unwelcome sources, such as email harvesters, comment spammers and the like. If we could reduce the frequency of these requests, we would potentially see many benefits, including

  • A decrease in the load on the database, resulting from fewer requests per second
  • A reduction in bandwidth, speeding up performance and lowering operational costs
  • A reduction in fake error messages - every error triggered on our sites is handled with a user-friendly message, and an email to support, for attention. In many cases, these are legitimate errors that we need to respond to. In other cases, these are messages we can ignore, as they are not triggered by performing a real world action. How many of the error emails we receive fall into this category, and which ones?
  • An improvement in session management. Unlike real user requests, bots hit every page without a cookie. So every page that is crawled by a bot results in a new session being created, and impacting either on memory, or the database, depending on how you store session information.

So how did we go about this? Read on…

Project Honeypot is a great online resource that logs IP addresses that have been involved in suspicious practices. An IP address on its register is categorised as Suspicious, Harvester, Comment Spammer, Search Engine, or Unknown, or possibly a combination of these.

The IP address register is a fluid list that all of the members of the Project Honeypot community contribute to. When a website development company signs up as a member of the community, it has the option of adding a honeytrap page to its web sites. This is a page that looks innocuous enough, but is hidden from users. Bots, though, can crawl every page on your site (particularly those bots that are not thoughtful enough to heed your robots.txt page), and this includes the honeytrap. It’s simple enough to conclude that any IP address that records a visit to this page is at the very least worthy of suspicion.

There are a number of ways for a web developer to link to a honeytrap page in such a way as to hide them from real users, but attract the attentions of bots. These include

  • CSS which does not display the link
  • CSS which positions the link off the page
  • HTML links with no content between the anchor tags
  • HTML links with a 1 pixel gif between the anchor tags

We’ve found that by including a combination of these in a function that is called on every page means that we’ve been able to attract a lot of bots to our honeytrap, resulting in more rogue IP addresses logged on the Project Honeypot register. With a large online community of sites with honeypot pages provided by Project Honeypot, this means that the web is seeded with countless trap pages recording the visits of potentially nefarious bots and spiders.

Even if a web development company does not want to contribute to capturing IP addresses, it can still use the register as a free resource against which to compare IP addresses. Simply posting a request to http://www.projecthoneypot.org/ip_#IPAddress#  will return information on that IP address, including the country of origin, user agent strings associated with its use, number of sightings, associated IP addresses, and a history of malicious use. Now, manually calling this page is all very well for the odd IP address that you need to check, but Project Honeypot also provide a service that allows you to call this programmatically. This is how we use the register. When a request comes to one of our sites, we identify the IP address. This can be found in the Remote_Addr value of the CGI scope. In the case of our site, where requests are handled by a load balancer, and arrive with that device’s IP address, we get it from the X-Cluster-Client-IP value.

We have a database table of all IP addresses that have visited our sites, and their status, which can be either unchecked, approved, blacklisted or whitelisted. If this IP address has never been seen before on our sites, we simply add a new row to the table for that IP address with an unchecked status. The idea is that all unchecked IP addresses are sent to Project Honeypot to see if they are malicious, but we don’t do this in real time, due to the performance hit. Rather we run a scheduled task every five minutes that loops through all unchecked IP addresses, and sends them over for review.

Each IP address that is sent to Project Honeypot is either on their list of suspicious IP addresses or not. Those that are viewed as safe, we mark as approved. Those that are viewed as unsafe, we mark as blacklisted. As of the time of writing, we have sent 350,000 IP addresses to Project Honeypot for review. Of these, 6,500 have been identified as unsafe and marked accordingly.

Next time round, I’ll be posting the second half of this article, explaining what it means for an IP address to be blacklisted on our sites, how we handle false positives, and other approaches we take to identifying bots.

Author: Garry Viner

Garry Viner

Garry Viner is the Director of Development and one of the founders of The Web Showroom. He has worked in IT since 1995 and has focussed on web since 1999.

At The Web Showroom he is still actively engaged in his first love, which is writing the code and creating the database objects that power your website. When he feels like sharing, he also allows his other developers to contribute.

Garry is committed to keeping Mission Control at the forefront of hosted CMS’s in Australia, ensuring your business has the platform to reach all its online goals.

 
Leave A Comment

Name *

Email * (will not be published)

Comment *

Please type the characters you see below

Visual verification
Hard to read? Click here for a new code.

 
Smart50 Awards 2011
 
2012 BRW Fast Starters
First Name  *
Last Name
Email  *
Phone  *
Privacy

  • "My online store with The Web Showroom makes more sales, ranks higher in Google and is easier to update than my old site. I haven’t looked back and can confidently recommend them to build your website."

    Peter Boyce, Owner
    PC Dictation

News
 
 
Web Design
1800 981 442

Website Design
CMS
Directory Web Design
E-Commerce
Web Design Prices

Online Marketing
SEO
Pay Per Click
Conversion Optimisation
Google Maps Optimisation

Web Design & SEO Blog
SEO Friendly CMS
Build New Website
Hosted CMS
Conversion Marketing

Become a fan
on facebook

Join Us
on Google+

Follow us
on Twitter

*All prices exclude GST