When I started this blog, I intended on using it to document my exploits on the web. I have been building websites for quite a long time now, and while I can’t say that I know everything, I’m pretty darn good at my job.
Last July I took a position as a Project Manager at a very big web development company. I was given a handful of sites to manage– sites that were in pretty bad decay– and tasked with fixing them up. Along the way, I’ve learned a hell of a lot of stuff… but most importantly, that there are an amazing amount of “webmasters” out there who don’t understand the basic fundamentals of search engine optimization. So today, we’re going to address the highly ignored and forgotten robots.txt.
The robots.txt file sits on your server and tells spiders where they’re allowed or not allowed to go on your site. It doesn’t exactly stop them from going there, but it tells them where you’d like them to not go. This is important to remember… anything you put in there doesn’t necessarily do anything, since a spider can decide to completely ignore the robots.txt file.
Now that we know what the robots.txt file does, why would anyone want to block spiders from accessing parts of their site? Well… Maybe we should start with the wrong reasons people block spiders, and why those reasons are wrong.
1.) I have private sections of my site that I don’t want everyone to see. If you have something that you want to keep private, the last place that you want to put them is on the internet. Even if you password protect the area, someone out there can get in to see it. In fact, the more juicy the area is, the more inclined people will be to try to get in. Still, all the time you see people not password protecting sections of their site and listing them in the robots.txt file. When you do this, you’re basically telling everyone exactly where you keep your private information. This is a hard lesson learned a long time ago by The White House. If you want to keep your private data private, keep it off the web, or at least password protect it and don’t broadcast to everyone where it’s hiding.
2.) I don’t want spam bots stealing my email address. Remember how I said that spiders can choose to ignore the robots.txt file? As it turns out, the people who build those shady email address stealing bots…. well they just happen to be shady people! And, what do shady people allow their shady spiders to do? You guessed it… They allow their shady spiders to ignore your robots.txt file. Basically, when you block these spiders, you do nothing to stop them and just clutter up your robots.txt file. In fact, if you want to find a place for your email-harvesting to start gathering, a simple Google search will tell you everyone who’s got something to hide.
3.) Spiders use too much of my bandwidth. Nowadays, bandwidth is cheap, and if you’re site is seriously being bogged down by search engine bots, you should really look into upgrading your hosting. If your servers can’t handle the spiders, they definitely can’t handle a significant volume of productive traffic… which you’ll be guaranteeing never to see in mass if you block the spiders. Upgrade your hosting package and take a class on monetizing traffic.
4.) Google image traffic is garbage and I don’t want it. This one is hard to argue against, because Google image traffic pretty much is garbage. Still, I shy away from anything having to do with banning Google from your site, regardless of the capacity. There is, however, something to say about garbage traffic. In volume, garbage traffic costs you money in hosting and bandwidth, but it’s rarely worthless. If you’re site is properly set up to monetize, the quality of traffic becomes less important (for example, if you can sell your ads on a CPM basis, free garbage traffic can make big money). By catering to Google images you also open yourself up to great site branding opportunities– domain names are speckled all over the Google image results pages. Finally, if Google sends you 10,000 garbage visits a month, and from that you get one bookmark, incoming link, or registered user, aren’t you at least a little better off?
Now that we know why not to block spiders with robots.txt, why would we ever want to? It really comes down to optimizing the time spiders spend on your site, and making sure that the pages they index are the pages you want them to index. Let’s use a couple of examples:
Let’s say that you have a page on your site that displays every movie you’ve ever seen at the theater. It’s got three columns, one for the title of the movie, one for the date you saw it, and one for the rating you gave it. There is benefit to your users to having three versions of that page, each that sort the data differently: one alphabetically by title, one by the date, and one by the ranking. If these three pages have different url’s, Google will index all three of these separately. Now, you need to ask yourself some questions. Will Google recognize each file as having a unique value that’s different from the other pages? Is there any additional information on the second and third pages that will give it a different ranking for different keywords than the first page? If google is only going to spend a limited amount of time on my site, and index a limited number of pages while it’s there, is Google’s time better spent somewhere else on the site? Based on your answers to these questions, you might decide to allow Google to index the first page, but not index the other two pages. This is where robots.txt is your friend.
Let’s look at another example. Let’s say that next to each ranking on your movie list you’ve got a link to a page where people can email you their thoughts on the ranking you’ve given, and let’s say that the contents of that page are just a basic “contact us” form. Instead of making a separate page for each movie that you’ve seen, you have this page built dynamically, so now you have the links going to emailme.php?movie=movie1 for the first movie, emailme.php?movie=movie2 for the second, and so on. It’s a very effective way of building the site, and it’s commonly the way pages like this are created. Now, here’s the problem: Google is going to index all of those pages as unique url’s on your site. In reality, it’s the same page, with a variable causing it to be loaded slightly different. This can be a big issue if you have a whole lot of movies that you’ve seen, because Google is indexing a ton of pages that have no actual value. You’re never going to see any search engine traffic to those pages, and in many cases they have duplicate titles and meta data. Pages like this can be easily identified in Google Webmaster Tools and are an excellent example of pages that should be blocked by the robots.txt file.
So, there you have it…. The Idiot’s Intro to Robots.txt. It’s probably noteworthy to say why I decided to write this list. This week at work I was asked to look at a site we had purchased that wasn’t performing as well in the search engines as we would expect. As it turns out, the former owners had used robots.txt to block Yahoo completely and drastically limit the abilities of Googlebot. It was such a major find, that I was tasked with performing an audit on the robots.txt files of all our sites. What I found was astonishing, and I realized that even good webmasters fail to understand the importance of such a simple tool.
