Collection of Robots.txt Files

The implementation of a suitable robots.txt file is very important for search engine optimization. There is plenty of advice around the Internet for the creation of such files (if you are looking for an introduction on this topic read “Creat a robots.txt file“), but what if instead of looking at what people say we could look at what people do?

That is what I did, collecting the robots.txt files from a wide range of blogs and websites. Below you will find them.

Key Takeaways

  • Only 2 out of 30 websites that I checked were not using a robots.txt file
  • Even if you don’t have any specific requirements for the search bots, therefore, you probably should use a simple robots.txt file
  • Most people stick to the “User-agent: *” attribute to cover all agents
  • The most common “Disallowed” factor is the RSS Feed
  • Google itself is using a combination of closed folders (e.g., /searchhistory/) and open ones (e.g., /search), which probably means they are treated differently
  • A minority of the sites included the sitemap URL on the robots.txt file

The Minimalistic Guys


Problogger.net

User-agent: *
Disallow:


Marketing Pilgrim

User-agent: *
Disallow:

Search Engine Journal

User-agent: *
Disallow:

Matt Cutts

User-agent: *
Allow:
User-agent: *
Disallow: /files/

Pronet Advertising

User-agent: *
Disallow: /mt
Disallow: /*.cgi$

TechCrunch

User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/

The Structured Ones

Online Marketing Blog

User-agent: Googlebot
Disallow: */feed/

User-agent: *
Disallow: /Blogger/
Disallow: /wp-admin/
Disallow: /stats/
Disallow: /cgi-bin/
Disallow: /2005x/

Shoemoney

User-Agent: Googlebot
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed

Scoreboard Media

User-agent: *
Disallow: /cgi-bin/

User-agent: Googlebot
Disallow: /category/
Disallow: /page/
Disallow: */feed/
Disallow: /2007/
Disallow: /2006/
Disallow: /wp-*

SEOMoz.org

User-agent: *
Disallow: /blogdetail.php?ID=537
Disallow: /blog?page
Disallow: /blog/author/
Disallow: /blog/category/
Disallow: /tracker
Disallow: /ugc?page
Disallow: /ugc/author/
Disallow: /ugc/category/

Wolf-Howl

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /noindex/
Disallow: /privacy-policy/
Disallow: /about/
Disallow: /company-biographies/
Disallow: /press-media-room/
Disallow: /newsletter/
Disallow: /contact-us/
Disallow: /terms-of-service/
Disallow: /terms-of-service/
Disallow: /information/comment-policy/
Disallow: /faq/
Disallow: /contact-form/
Disallow: /advertising/
Disallow: /information/licensing-information/
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2004/
Disallow: /*?*
Disallow: /page/
Disallow: /iframes/

John Chow

sitemap: http://www.johnchow.com/sitemap.xml

User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Mediapartners-Google
Allow: /

User-agent: duggmirror
Disallow: /

Smashing Magazine

Sitemap: http://www.smashingmagazine.com/sitemap.xml

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /styles/
Disallow: /inc/
Disallow: /tag/
Disallow: /cc/
Disallow: /category/

User-agent: MSIECrawler
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Fasterfox
Disallow: /

User-agent: Slurp
Crawl-delay: 200

Gizmodo

User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://gizmodo.com/sitemap.xml

Lifehacker

User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://lifehacker.com/sitemap.xml

The Mainstream Media

Wall Street Journal

User-agent: *
Disallow: /article_email/
Disallow: /article_print/
Disallow: /PA2VJBNA4R/
Sitemap: http://online.wsj.com/sitemap.xml

ZDNet

User-agent: *
Disallow: /Ads/
Disallow: /redir/
# Disallow: /i/ is removed per 190723
Disallow: /av/
Disallow: /css/
Disallow: /error/
Disallow: /clear/
Disallow: /mac-ad
Disallow: /adlog/
# URS per bug 239819, these were expanded
Disallow: /1300-
Disallow: /1301-
Disallow: /1302-
Disallow: /1303-
Disallow: /1304-
Disallow: /1305-
Disallow: /1306-
Disallow: /1307-
Disallow: /1308-
Disallow: /1309-
Disallow: /1310-
Disallow: /1311-
Disallow: /1312-
Disallow: /1313-
Disallow: /1314-
Disallow: /1315-
Disallow: /1316-
Disallow: /1317-

NY Times

# robots.txt, www.nytimes.com 6/29/2006
#
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/

User-agent: Mediapartners-Google*
Disallow:

YouTube

# robots.txt file for YouTube

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /profile
Disallow: /results
Disallow: /browse
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /login
Disallow: /watch_ajax
Disallow: /watch_queue_ajax

Bonus

Google

User-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?

You can receive our articles for free on your email inbox, with more web design, SEO, monetization and blog tips. Just enter your email below:

49 Responses to “Collection of Robots.txt Files”

  1. Stephen on June 14th, 2007 5:50 am

    Wow, I’m surprised that so many SEO experts don’t include a line for sitemap autodiscovery. It’s not like it’s difficult to implement or anything…

  2. Daniel on June 14th, 2007 5:54 am

    Stephen, I don’t think the “autodiscovery” factor is related to how easy it is to implement.

    The question is: will it bring tangible improvements?

  3. Pablo on June 14th, 2007 5:56 am

    nice stuff, i already changed my robots.txt

  4. Nia on June 14th, 2007 6:01 am

    This looks valuable except I don’t know how to use it yet. I’ve put it in my RSS shares and when I figure it out I’ll implement the lesson and post about it. Thanks. ;)

  5. Daniel on June 14th, 2007 6:05 am

    Nia, sorry for that I just updated the article with a link to an introductory post I wrote sometime ago:

    http://www.dailyblogtips.com/c.....stxt-file/

  6. John Wesley on June 14th, 2007 6:34 am

    Very interesting post. I actually started using something similar to Chow’s after he published it on his blog last week. It seems to be adding a bit of Google traffic.

  7. Hugh | A Politically Incorrect Entrepreneur on June 14th, 2007 7:14 am

    While crawling around the interweb a few days ago, I found the robots.txt file for the whitehouse (whitehouse.gov/robots.txt)

    I just thought it interesting the things they disallowed.

  8. Adnan on June 14th, 2007 8:06 am

    Hey Daniel - thanks for that compilation - its very interesting to see how some SEO sites like SearchEngineJournal were minimal, but how SEOMoz has something different.
    Now I need to decide on which one to choose ;)

  9. Daniel on June 14th, 2007 8:10 am

    Adnan, I will need to tweak mine as well. So far I am getting pretty good results with a minimalistic one though, just exclusing feeds, trackbacks and WP files.

  10. Patrix on June 14th, 2007 1:48 pm

    I have been tweaking my robots.txt file for quite some time now mostly to reduce duplicate content (get pages out of supplemental hell)but haven’t noticed any appreciable difference.

    I have been checking a few A-bloggers blogs for their robots.txt files so thanks for doing this.

    BTW why does Shoemoney disallow some directories with and without the forward slash? What is the difference?

  11. Jordan McCollum on June 14th, 2007 2:16 pm

    What timing! I was just contemplating roboting out my category and archive pages. Thanks for this!

  12. Pchere on June 14th, 2007 9:03 pm

    I am also tweaking my robots.txt to remove duplicate content in Wordpress. It was very insightful to see how top sites are dealing with the issue.

  13. CypherHackz on June 14th, 2007 11:17 pm

    i have this, list of robots.txt links. you can see here: Big Websites with Big Robots

  14. Matt Wardman on June 15th, 2007 5:42 am

    >Wow, I’m surprised that so many SEO experts don’t include a line for sitemap autodiscovery. It’s not like it’s difficult to implement or anything…

    If you have a Google sitemap plugin for Wordpress it pingfs google every time you post anyway.

    And:

    The robots.txt for webmasterworld.com has a blog in it. Fun.

    http://www.mattwardman.com/blo.....r-techies/

  15. Zath on July 1st, 2007 12:56 am

    I recently set up a robots.txt file and have noticed that my supplemental links on Google have gone down from around 2000 pages to about 250.

    I’m thinking that’s pretty good, but like others have said, I’m not quite sure how much of a difference it makes to my site rankings.

    Will this give more search engine traffic going forward or increase the chances of a better Pagerank?

  16. vijay on July 1st, 2007 6:04 am

    Hmm. I have not thinked yest to update my robot.txt
    Its as simple as problogger.net no complications ;-)
    I actually avoided that part coz I am not that much aware of robot.txt file changes and its effects!
    Soon will give some time for that.
    Thanks for the advice anyway.

  17. vijay on July 1st, 2007 6:05 am

    Hmm. I haven’t thought yet to update my robot.txt
    Its as simple as problogger.net no complications ;-)
    I actually avoided that part coz I am not that much aware of robot.txt file changes and its effects!
    Soon will give some time for that.
    Thanks for the advice anyway.

  18. TechZilo on July 14th, 2007 3:04 am

    I’d like to echo Zath’s question, since my number of indexed pages has gone down too..will it affect SERPs?

  19. Daniel on July 14th, 2007 8:16 am

    The effect upon individual ranking of your pages should not be huge, so do not expect to go from the tenth page to the first page of Google just because of using a robots.txt file.

    That said, your search engine traffic will probably increase a lot if many of your pages were in the supplemental hell. First and foremost because now you will be cover many more keywords and terms.

  20. Visitor367 on July 19th, 2007 8:18 am

    I have visited your site 623-times

  21. Visitor413 on July 19th, 2007 8:18 am

    Your site found in Google: http://google.com/search?q=gdk

  22. AskApache on August 9th, 2007 5:35 am

    The real benefit to learning about the robots.txt file and how it works is it teaches you to think like the web crawlers. Especially when you start targeting different user-agents/bots…

    webmasterworld is definately the coolest, and 2nd is of course askapache.com

    http://www.askapache.com/seo/u.....press.html

  23. 东莞网站建设 on November 20th, 2007 1:06 am

    Very good learning

  24. Sangesh on January 3rd, 2008 7:23 am

    I got to know more about the “robots.txt” in this article.

    Thanks.

  25. John on February 8th, 2008 12:59 am

    Hello…!

    Can anyone tell me the list of websites which archives the websites. Pandora, Internet archive’s Waybackmachine are the some of the examples, I want to know the entire web archiving websites, please…..

  26. SEO Freelancer on March 8th, 2008 12:59 am

    Nice collection - this will help me and new webmaster and web designer to create a robots.txt file as they want

  1. Pablogeo | Colección de robots.txt
  2. Wordpress robots.txt — ABlueStar
  3. Come sono strutturati i file Robot.txt dei siti più importanti del web? | MondoBlog
  4. Colección de Robots.txt
  5. Personalizando el robot.txt de tu sitio. « W W W . B Y N A R I O . A R . N U
  6.   Links Roundup - June 19th 2007
  7. Interesting Links « Organized Mashup
  8. Examples of Robots.txt Files - Jaan’s Search Marketing Blog - Toledo, Ohio
  9. Robots.txt collection
  10. Blog Setup: 40 Practical Tips
  11. Instalación de un Blog: 40 Tips Practicos
  12. Pajama Mommy>>Mommy Blogger Community » Blog Archive » Want PR? Technorati? Visitors?
  13. Top 10 basic SEO Tips to build high traffic web site
  14. Control Search Engine Robots » ABlogCo.com
  15. Control Search Engine Robots » ABlogCo.com
  16. How Important Is A Robots txt. File : Blogging Without A Blog
  17. Best Wordpress Beginner Articles and Worthy Plugins - Roundup
  18. The Blogger Tips
  19. Control Search Engine Robots : MarketingVoice.net
  20. Control Search Engine Robots : MarketingVoice.net
  21. Setting up Blog for the first time in WordPress??? Need a Step by Step Guide ?? No Problem.. have a look at the given resource at DailyBlogTips | Rahul's Blog
  22. Robots.txt for Wordpress | Websites 101 - web design tips by Hervey Bay Pages
  23. Optimize robots.txt for Better SEO

Got something to say?





Sponsors

Online Invoicing For Freelancers Premium WordPress Themes Why I recommend Doreo Hosting Maximize Your Rankings Twitter Style Social Bookmarking

Popular Articles

Recent Articles

Subscribe via E-Mail


Killer Domains eBook