Block Dynamic URLs From Googlebot Using Your Robots.txt File
I have been trying to find out how to block some dynamic urls from the Googlebot. The search bots for Yahoo! Slurp and MSNBot use the same or very similar syntax to block dynamic urls. As an example I have this one line in my htaccess file which allows me to use static pages instead of dynamic pages but I found sometimes the Googlebot will still crawl my dynamic pages. This can lead to duplicate content which isn’t condoned by any of the major search engines.
I am trying to clean up my personals site as it currently ranks well with Yahoo but not Google. I believe MSN Live has similar algorithms to Google but this isn’t scientifically proven by any means. I only state this from my own personal experience with SEO and my client’s sites. I believe I have found some answers on ranking well with Google, MSN and possibly Yahoo. I’m in the midst of testing right now. I have managed to rank well on Google for a client’s site already for relevant keywords. Anyway, here is how to block the dynamic pages from Google using your robots.txt file. The following is an extract of my htaccess file:
RewriteRule personals-dating-(.*).html$ /index.php?page=view_profile&id=$1
This rule, in case you’re wondering, allows me to create static pages such as personals-dating-4525.html from the dynamic link index.php?page=view_profile&id=4525. However, this has caused problems as now the Googlebot can and has “charged” me with duplicate content. Duplicate content is frowned upon and causes more work on Googlebot because now it has to crawl extra pages and it can be viewed as spammy by the algorithm. The moral is duplicate content should be avoided at all costs.
What follows is an extract of my robots.txt file:
User-agent: Googlebot
Disallow: /index.php?page=view_profile&id=*
Notice the “*” (asterisk) sign at the end of the second line. This just tells the Googlebot to ignore any number of characters in the asterisk’s place. For example, Googlebot will ignore index.php?page=view_profile&id=4525 or any other number or set or characters. In other words, these dynamic pages will not be indexed. You can check to see if your rules in your robots.txt file will function correctly by logging into your Google webmaster control panel account. If you don’t have a Google account then you simply need to create one from Gmail, AdWords or AdSense and you’ll have access to the Google webmasters tools and control panel. If you’re wishing to achieve higher rankings then you should have one. Then all you need to do is be logged into your gmail, adwords, or AdSense accounts to have an account. They make it pretty simple to set up an account and it’s free. Click the “Diagnostics” tab and then the “robots.txt analysis tool” link under the Tools section in the left column.
By the way, your robots.txt file should be in your webroot folder. The Googlebot checks your site’s robots.txt file once a day and it will be updated in your Google webmasters control panel under the “robots.txt analysis tool” section.
To test your robots.txt file and validate if your rules will function correctly with Googlebot then simply type in the url that you would like to test in the field “Test URLs against this robots.txt file”. I added the following line to this field:
http://www.personals1001.com/index.php?page=view_profile&id=4235
Then I clicked on the “Check” button at the bottom of the page. The Googlebot will block this url given the conditions. I believe this is a better way to block Googlebot rather than use the “URL Removal” tool which you may use. The “URL Removal” tool is on the left column of your Google webmasters control panel. I’ve read in a few cases in the Google groups that people have had problems with the “URL Removal” tool.