Robots.txt Tutorial


Robots are programs that automatically crawl the Web and retrieve documents. Web browsers like Internet Explorer or FireFox are operated by humans and don’t automatically retrieve text from referenced documents. Robots are are most often referred to as crawlers, bots, or spiders. These robots visit sites by requesting documents from them. Search engines like Google, Yahoo! and MSN Search employ robots to crawl web documents for the purposes of being indexed and provided as search engine results.

Robots decide to visit a site based on a historical list of URLs, especially of documents with many links elsewhere. A directory or any web page that lists external links is a candidate for a robot visit. Most search engines allow you to submit URLs manually, which will then be queued and visited by the robot. Robots select URLs to visit and to parse as a source for new URLs. Most robots–benevolent robots–routinely check for a special file called “robots.txt” which can be installed by the server administrator of any web site. There may be reasons which a webmaster would want to exclude a robot from visiting his site. One very common reason is for exclusion is due to the large amount of bandwidth that robots eat up. A webmaster may also want the robot to exclude sensitive information or images or other files.

Robots.txt Exclusion

To prevent robots visiting your site put these two lines into the /robots.txt file that lives in the root directory of the server:

User-agent: *
Disallow: /

But rarely does a webmaster want to exclude robots from visiting an entire site. Webmasters can write a structured text file instructing robots to stay away from certain areas of the server. Webmasters can even choose which robots to allow or disallow. Below is an example of how an exclusion may be written inside a robots.txt file:

# /robots.txt file for http://www.google.com

User-agent: Googlebot
Disallow:

User-agent: sillycrawler
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /cgi-bin

The first two lines, starting with ‘#’, specify a comment

The first example specifies that the robot called “Googlebot” is allowed to go anywhere.

The second example indicates that the robot called “sillycrawler” has all relative URLs starting with ‘/’ disallowed. Because all relative URL’s on a server start with ‘/’, this means the entire site is disallowed.

The third example indicates that all robots should not visit URLs starting with /tmp or /cgi-bin. The “*” is a special token that refers to “any other User-agent”; wildcard patterns or regular expressions cannot be used in either User-agent or Disallow lines.

See the Standard for Robots Exclusion for details or see an example of a robots.txt file at WebmasterWorld.com.

Source: Robots.txt FAQ by Martijn Koster.

We Recommend

“Robots.txt Tutorial” has 10 Comments

  1. Emory Rowland Says:

    This is some good info about Googlebot:

    http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40364

  2. Dsouza jhn Says:

    can you provide some more information on google robot.

  3. Ralph Jones Says:

    Very interesting post on Robots.txt. I am a bit new to this so I book marked this to help me further. Indeed sometimes there could be pages that you would want the Google bot or any other to visit temporarily and this is indeed helpful.

  4. paul Says:

    nice bit of info. i had no idea you could control the agents visiting your site! pity neo didnt know about that in the matrix! would have made his life a whole lot easier! ;-)

  5. Shiva Says:

    Hello,

    I can understand the robots.txt file is used by major search engines. Does the spam search engine follow the robots.txt rules.

    let me know !

  6. Emory Rowland Says:

    It’s voluntary so I’d expect them not to follow the robots.txt rules.

  7. Md Azhar Says:

    Hello

    Thanks for sharing info, Can we block the sub domain using robots.txt ?

    Thanks

  8. Emory Rowland Says:

    Sure, you would just put these lines in a separate robots.txt file in the subdomain root directory.

    User-agent: *
    Disallow: /

  9. SEO Services Says:

    I Want to exclude (block) all pages of my website instead of “home (index)” page so what will best and simple technique (coding), can use in robot.txt file?……. Thanks in advance

  10. Emory Rowland Says:

    That’s a tough one. If it were a small site, my first inclination would be to block the specific directories and file names except for the index. I bet someone has a better answer though.

Dare you to Leave a Comment :)

Compare Web Hosting
  Host Highlights Ratings Price
1. BlueHost
  • 24/hr US Support
  • Avg Hold Times < 30 Sec
  • Money Back Guarantee
5/5
Review
$3.95
2. GreenGeeks
  • Environmentally Friendly
  • Quick Support
  • Easy Control Panel
5/5
Review
$2.45
3. HostGator
  • 4,500 Free Templates
  • Money Back Guarantee
  • 24/7/365 Tech Support
5/5
Review
$3.71
4. IX Web Hosting
  • Zero Risk Guaranteed
  • 2 Free Dedicated IPs
  • Personal Support Hero
5/5
Review
$3.95
5. ClickHost
  • Service Oriented
  • Simplicity
  • WordPress Friendly
4/5
Review
$3.71
Web host ratings are based on the Clickfire web hosting review process.

Full list of web hosting reviews >>