The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl and index pages on their website.
Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no
means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt
is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that
you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from
coming in but the good guys will not open to door and enter. That is why we say that if you have really sensitive data, it is too naïve to
rely on robots.txt to protect it from being indexed and displayed in search results.
When a search engine crawls (visits) your website, the first thing it looks for is your robots.txt file. This file tells search engines what they should and should not index (save and make available as search results to the public). It also may indicate the location of your XML sitemap.
Google’s official stance on the robots.txt file
Robots.txt file consists of lines which contain two fields: line with a user-agent name (search engine crawlers) and one or several lines starting
with the directive
- How to create a robots.txt file
You will need to create it in the top-level directory of your web server.
When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.
For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “http://www.example.com/robots.txt”.
So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.
Remember to use all lower case for the filename: “robots.txt“, not “Robots.TXT.
You can simply create a blank file and name it robots.txt. This will reduce site errors and allow all search engines to rank anything they want.
Here’s a simple
User-agent: * Allow: /wp-content/uploads/ Disallow: /
1. The first line explains which agent (crawler) the rule applies to. In this case,
User-agent: * means the rule applies to every crawler.
2. The subsequent lines set what paths can (or cannot) be indexed.
Allow: /wp-content/uploads/allows crawling through your uploads folder (images) and
Disallow: / means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.
3. The rules for different crawlers can be listed in sequence, in the same file.
- Examples of usage
Prevent the whole site from indexation by all web crawlers:
Allow all web crawlers to index the whole site:
Prevent only several directories from indexation:
Prevent site’s indexation by a specific web crawler:
- Robots.txt for WordPress
Don’t choose to disallow the whole wp-content folder though, as it contains an ‘uploads’ subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:
# disallow all files in these directories
- Miscellaneous remarks
- Don’t list all your files in the robots.txt file. Listing the files allows people to find files that you don’t want them to find.
- An incorrect robots.txt file can block Googlebot from indexing your page
- Put your most specific directives first, and your more inclusive ones (with wildcards) last