Preventing web crawlers from indexing everything
Ok, so we’ve seen how to password protect directories to keep the web crawlers out, but I don’t want to go through that. I want to keep the page open, but I don’t want it spidered and indexed by the bots.
There are ways for doing this too. In fact there are several. The most commonly accepted and respected way of telling a bot not to crawl certain areas of a website is with what’s called a robots.txt file. Usually this is put in the same folder as your main site index and looks like this.
User-agent: *
Disallow: /
The above will keep all robots out of your site. This might be too heavyhanded though. Let’s say the msnbot has been a bit too voracious with your downloads area
User-agent: msnbot
Disallow: /downloads/
That should be enough to keep it out of that folder. Here’s another example.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
You can get more complicated than this if you need to.
Here’s a link to googles robots.txt file
To exclude a specific file from being indexed, you might try the following meta tag in your document.
you can also use index and follow to fine tune what you want to restrict or allow.
I’m not certain that the respecting of a meta tag is widely held. robots.txt is more likely to be followed.
To just exclude the googlebot, you might try this…
according to Googles page on removing pages from the index
Google apparently will respect that tag and that would allow other bots through.