Robots.txt Tutorial

Robots are programs that automatically crawl the Web and retrieve documents. Web browsers like Chrome or FireFox are operated by humans and don’t automatically retrieve text from referenced documents. Robots are most often referred to as crawlers, bots, or spiders. Their job is to visit sites and request pages from them, aka “crawl” them. Search engines index the web pages that robots crawl and provide them as search engine results for users.

Search engine robots find sites to crawl based on a historical list of URLs. Any site or page indexed by a search engine is a candidate for robots to crawl. If the page has links pointing out to other pages bots will try to follow the links.

Most robots–benevolent robots–routinely check for a special text file called “robots.txt” which can be installed by the server administrator of any web site. There may be reasons you’d want to block or “exclude” a robot from visiting your site. One very common reason for exclusion is due to the large amount of bandwidth that unbridled robots can eat up. There may also be files for that you don’t want crawled and indexed by search engines for the world to see, perhaps that image file of that wild night that you and your buddies tried cat juggling.

Robots.txt Exclusion Examples

To prevent robots visiting your site put these two lines into the /robots.txt file that lives in the root directory of the server:

User-agent: *
Disallow: /

But you don’t always want to exclude bots from visiting an entire site. You can write a structured text file instructing robots to stay away from certain areas of the server. You can even choose which robots to allow or disallow. Here is an example of how an exclusion may be written inside your robots.txt file:

# robots.txt file for my site

User-agent: Googlebot
Disallow:

User-agent: hungrycrawler
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /cgi-bin

The first line starts with ‘#’, and specifies a comment.

The next two lines specify that the robot called “Googlebot” is allowed to go anywhere (no trailing slash).

The next two lines instruct a robot called “hungrycrawler” that it has been completely disallowed.

The last group tells all robots to refrain from visiting URLs starting with /tmp or /cgi-bin. The “*” is a special token that refers to “any other User-agent.

Check out the robots.txt Testing Tool in Google Search Console to see how Googlebot is crawling your site.

Source: Robots.txt FAQ by Martijn Koster.

Emory Rowland

I'm editor and keeper of the flame at Clickfire, fanatical social media blogger and builder of Internet things from way back. My love for social media and success with organic search led me to start my own consulting company. Apart from the Internet, I could be considered pretty worthless. More...

28 comments

  1. Emory Rowland
  2. can you provide some more information on google robot.

  3. Ralph Jones

    Very interesting post on Robots.txt. I am a bit new to this so I book marked this to help me further. Indeed sometimes there could be pages that you would want the Google bot or any other to visit temporarily and this is indeed helpful.

  4. paul

    nice bit of info. i had no idea you could control the agents visiting your site! pity neo didnt know about that in the matrix! would have made his life a whole lot easier! ;-)

  5. Hello,

    I can understand the robots.txt file is used by major search engines. Does the spam search engine follow the robots.txt rules.

    let me know !

    • Emory Rowland

      It’s voluntary so I’d expect them not to follow the robots.txt rules.

  6. Hello

    Thanks for sharing info, Can we block the sub domain using robots.txt ?

    Thanks

    • Emory Rowland

      Sure, you would just put these lines in a separate robots.txt file in the subdomain root directory.

      User-agent: *
      Disallow: /

  7. SEO Services

    I Want to exclude (block) all pages of my website instead of “home (index)” page so what will best and simple technique (coding), can use in robot.txt file?……. Thanks in advance

    • Emory Rowland

      That’s a tough one. If it were a small site, my first inclination would be to block the specific directories and file names except for the index. I bet someone has a better answer though.

  8. Hello..

    Shall any one help how should I block the url’s in bing webmaster tools.. I have added removal url’s in robots.txt working fine in google webmaster tools

    Thanks

  9. Emory Rowland

    Hi Ravi, Have you looked into the Bing Webmaster Tools block URLs feature?

  10. Thank you Mr. Emory Rowland
    I got an idea
    I used to done black hat SEO for my website like keywords.mywebsite.com now I have complete awareness of white hat seo I made many changes in my website, I have removed sub domains from my cPanel but still in google showing the sub domains when I type site:www.mywebsite.com could you please help me how should I block them.

    Thanks in Advance.

    • Emory Rowland

      Hi Ravi, if you want to block subdomains, you have to place the robots.txt file in the root of each subdomain that you want to block. This would direct all crawlers to stay out:

      User-agent: *
      Disallow: /

      But, since you have already deleted your subdomains in cPanel, that’s a problem. You could add them back or just wait for them to get disappear eventually.

  11. Thank you very much Mr. Emory Rowland

  12. Dear Emory Rowland,

    Do you have any Idea
    Google is not taking my page description in SERP results it is taking my body content in middle of the paragraph.

    Even the title also I have used something and google is showing something.

    Previously my website was in the top position in search results I am not doing any updates in my website..

    Could you please give me an idea about my problem?

    • Emory Rowland

      Hi Ravi, Sometimes Google will display their own “abstract” or Open Directory title or description if yours is missing or they don’t like the one you have. Some reasons could be that you are including only generic keywords the title tag – you may want to include your site’s brand or domain name keywords, especially on the home page – or that the title is missing. If you make the change, let us know how it works please.

      • Dear Emory Rowland
        I have made the changes in my website according to your post even I have use my brand name in home page
        I will let you know once Google crawl my website

        Thank you

        • Emory Rowland

          Okay Ravi, Standing by.

          • Dear Emory Rowland,
            Thank you!!
            I am getting the proper output slowly, my web page is going to top in SERP results slowly….
            I got an email today that “The Keywords Meta Tag is Still Used” from activesearchresults.com
            I may have to use meta keywords for good SERP Results ?

          • Hi Ravi, I understand that it’s only used by Bing and that’s to detect potential spam sites that stuff words in the tag. It won’t help you. I usually leave it out nowadays.

  13. Thanks for the info. i really appreciate

  14. amy

    thanks for sharing such an easy guide about robot.txt

  15. Great article , i don’t know about robot.txt file. Now i got clear cut idea. Thanks for your valuable information.

  16. very good robot.txt . I used in mohsen-chavoshi.ir

  17. Great Article and great feature given by google as well so that we can block the URLs or folder that we don’t want to show google. thanks a lot.

  18. riya

    thanks for sharing such an easy guide about robot.txt

  19. This helps me alot. I wanted to block certain areas of my website. This article helps me in doing just that.

Leave a comment