Robots.txt file is a way to tell search engines and other crawlers which parts of your website should be crawled and indexed.
WordPress uses a virtual robots.txt file by default.
This means that if you open your blog’s root directory, you will not find a file named robots.txt unless you created one manually.
However, you can view the WordPress generated virtual robots.txt by adding “/robots.txt” to your blog’s URL (e.g. yourblog.com/robots.txt).

The virtual robots file includes the following 2 lines (unless it is modified by some kind of plugin or by your blog’s privacy settings):
User-agent: *
Disallow:

These lines tell all crawlers (called user agents) that all the pages and directories of your site can be indexed, including your admin pages (such as yourblog.com/wp-admin/ ).

As you may have guessed, indexing admin pages in search engines is not recommended, so you should definitely do some tweaking to these settings in order to improve SEO and prevent irrelevant pages from being indexed.
I recommend to add the following rules:

  • If you use both categories and tags in your blog, do not index both (reason: duplicate content).
  • Prevent search result indexation (if you have a search feature in your blog. reason: duplicate content).
  • Consider preventing author and date archives (depends on whether you index categories and tags. reason: duplicate content).
  • Prevent indexation of all admin pages.
  • Advanced – consider preventing indexation of URLs that include arguments (depending of whether you use SEF premalinks). example: yourblog.com?s=q

There are a couple of options to set these rules:

  • Install a plugin like Yoast’s Robots meta.
    This plugin adds meta tags to the head section of the pages and tell tells the search engine whether or not to index them.
    It also allows you to control search engine indexing for individual posts or pages.
  • Create a robots.txt file. This is very simple, you can use your notepad for this task and save the file as robots.txt.
    Alternately, you can generate this file using an online generator such as this.
    Once you have finished, you should upload the file to your blog’s root directory.
    I use the following rules in my robots.txt file. It covers the steps above plus a few more.
    Note that I block search engines from crawling my category pages because I decided to use tags in this blog.

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /wp-login.php
    Disallow: /*wp-login.php*
    Disallow: /trackback
    Disallow: /feed
    Disallow: /comments
    Disallow: /author
    Disallow: /contact/
    Disallow: */trackback
    Disallow: */feed
    Disallow: */comments
    Disallow: /z/j/
    Disallow: /z/c/
    Disallow: /stats/
    Disallow: /dh_
    Disallow: /category/*
    Disallow: /category/
    Disallow: /login/
    Disallow: /wget/
    Disallow: /httpd/
    Disallow: /i/
    Disallow: /f/
    Disallow: /t/
    Disallow: /c/
    Disallow: /j/
    Disallow: /*.php$
    Disallow: /*?*
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.gz$
    Disallow: /*.wmv$
    Disallow: /*.cgi$
    Disallow: /*.xhtml$
    Disallow: /*?*
    Disallow: /*?
    Allow: /wp-content/uploads
    
    # alexa archiver
    User-agent: ia_archiver
    Disallow: /
    
    # disable duggmirror by Digg
    User-agent: duggmirror
    Disallow: /
    
    # allow google image bot to search all images
    User-agent: Googlebot-Image
    Disallow: /wp-includes/
    Allow: /*
    
    # allow adsense bot on entire site
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*
    

Once you have set a robots.txt file for your blog, you can test it to see if it does what it should (blocking certain pages).
To test it, you can use the Crawler Access tool in Google Webmaster Tools:

  • In GWT, go to Site Configuration -> Crawler Access.
  • In this page, make sure the text area of the robots.txt file has been downloaded recently by Google and that it reflects the most recent changes you have made.
  • In the URLs box, type different URLs to test against (for example, yourblog.com/wp-admin/) and click on the Test button.
    The result displays something like “Blocked by line 3: Disallow: /wp-admin”.
    If it doesn’t, you missed something when creating the file.

Another useful method to test the effectiveness of your robots.txt file, is to use Robots.txt Analyzer.

Topics:   |