Robots.txt revisited. Also, which robots to disallow?
We’ve been switching servers and web hosting for a lot of web sites in the last 3 weeks. We’ve allotted a healthy dose of bandwidth to these site’s new shared hosting accounts knowing that upload bandwidth can be counted in the traffic even though one of our dedicated server providers don’t count incoming FTP traffic. So we were surprised to find that Rakrakan, one of the sites we recently moved had already eaten up the 5Gig of bandwidth assigned to it, this is a 54MB site powered by WordPress where all if not most content are text blobs on the database. A quick look at the Webalizer traffic logs (provided on cPanel) revealed visits from trusted and known web robots GoogleBot and Yahoo’s Slurp.
However, we also noticed a new entry by the name of TurnitinBot, a quick search turned up the FAQ for TurnitinBot. According to them,
“This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com”
Well, since the web site Rakrakan is most likely not subject of teaching at educational institutions, we can reasonably block TurnitinBot from visiting the web site again.
So here’s a refresher on ROBOTS.TXTÂ – that simple way of controlling how Internet robots interact with your web site. Controlling robots is rather important especially if your web site has important assets you don’t want to appear on third-party databases (whether these data will appear on the web or not). It is also important to control robots if the web hosting account has limited bandwidth allocation. A robot can be told what to do, what to visit, where to go, whether to go slow… etc. Webmasters should make it a priority to have a properly formatted robots.txt – visit the RobotsTXT.org web site for the lowdown on how to implement robots.txt for your site.
In the meantime, I have been searching on the Internet for information on which robots to disallow completely from web sites. There is a ROBOTS DATABASE on RobotsTXT.org but there is no quick guide on which ones to disallow from a web site. Any experience or opinion as to which robots to disallow without harming a web site’s SEO and SERP scores please post them here.
Photo credit: flysi on Flickr .