Learn about robots.txt file

The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl and index pages on their website.
Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no
means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt
is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that
you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from
coming in but the good guys will not open to door and enter. That is why we say that if you have really sensitive data, it is too naïve to
rely on robots.txt to protect it from being indexed and displayed in search results.

robots.txt

When a search engine crawls (visits) your website, the first thing it looks for is your robots.txt file. This file tells search engines what they should and should not index (save and make available as search results to the public). It also may indicate the location of your XML sitemap.

Google’s official stance on the robots.txt file

Robots.txt file consists of lines which contain two fields: line with a user-agent name (search engine crawlers) and one or several lines starting
with the directive

  • How to create a robots.txt file

You will need to create it in the top-level directory of your web server.

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “http://www.example.com/robots.txt”.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: “robots.txt“, not “Robots.TXT.

You can simply create a blank file and name it robots.txt. This will reduce site errors and allow all search engines to rank anything they want.

Here’s a simple robots.txt file:

User-agent: *
Allow: /wp-content/uploads/
Disallow: /

1. The first line explains which agent (crawler) the rule applies to. In this case, User-agent: * means the rule applies to every crawler.

2. The subsequent lines set what paths can (or cannot) be indexed. Allow: /wp-content/uploads/allows crawling through your uploads folder (images) and Disallow: / means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.

3. The rules for different crawlers can be listed in sequence, in the same file.

  • Examples of usage

robots-allow-all

Prevent the whole site from indexation by all web crawlers:

User-agent: *
Disallow: /

Allow all web crawlers to index the whole site:

  User-agent: *
Disallow:


Prevent only several directories from indexation:

User-agent: *
Disallow: /cgi-bin/


Prevent site’s indexation by a specific web crawler:

User-agent: Bot1
Disallow: /

  • Robots.txt for WordPress
NetDNA-Blog-RobotsTxt-R11
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. So in case you didn’t create yet a real robots.txt, create one with any text editor and upload it to the root directory of your server via FTP.
Blocking main WordPress Directories
There are 3 standard directories in every WordPress installation – wp-content, wp-admin, wp-includes that don’t need to be indexed.

Don’t choose to disallow the whole wp-content folder though, as it contains an ‘uploads’ subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:

User-Agent: *
# disallow all files in these directories
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

  • Miscellaneous remarks
  • Don’t list all your files in the robots.txt file. Listing the files allows people to find files that you don’t want them to find.
  • Don’t block CSS, Javascript and other resource files by default. This prevents Google bot from properly rendering the page and understanding that your site is mobile-optimized
  • An incorrect robots.txt file can block Googlebot from indexing your page
  • Put your most specific directives first, and your more inclusive ones (with wildcards) last

3 thoughts on “Learn about robots.txt file

  1. Great post. I was checking constantly this blog and I’m
    inspired! Very useful information particularly the final phase :
    ) I take care of such info much. I was seeking this certain info for a very
    lengthy time. Thank you and best of luck.

  2. you are actually a just right webmaster. The site loading speed is amazing.
    It seems that you’re doing any unique trick. Furthermore, The contents are masterwork.
    you’ve performed a great process in this topic!

  3. I’m amazed, I have to admit. Rarely do I encounter a blog
    that’s equally educative and interesting, and without a doubt,
    you’ve hit the nail on the head. The problem is an issue that
    too few men and women are speaking intelligently about.
    I’m very happy that I found this in my search for something
    concerning this.

Leave a Reply

Your email address will not be published. Required fields are marked *