Using robots.txt file and robots.txt disallow and allow rules you can control which search engines should crawl your site’s content. It is is a convention or protocol to prevent cooperating web spiders and other web robots from accessing all or part of a website. The robots.txt file is commonly used to block these spider bots (search engine”. Bots are normally identified as “user-agents“).
Let us check how we can utilize robots.txt to manage crawler access to your content. Robots.txt exclusion is the process of disallowing bots to access your content. The robots.txt file must be saved in a text file and in ASCII or UTF-8 character set format.
You can achieve the same results using the “robots” meta tag. The “robots” meta tag is discussed in “ROBOTS tag attribute – META NAME=”robots” and CONTENT values Explained” a Guide on how to use the robots tag.
Google Webmaster Central provides tool to generate and validate robots.txt. The Google robots.txt generator can be used to create robots.txt file fro your website. Let us discuss how to use robots.txt files and disallow and allow rules .
Allow all search engine bots full access
To allow all search bots or search engine bots you need to do either of the following options.
1. Create an empty robots.txt file This will allow all robots full access to your website and content.
2. Add exclusion rules in robots.txt.
The below example robots.txt shows how you can configure your robots.txt to allow all the user agents.
1 2 | User-agent: * Disallow: |
How to block certain search bots from accessing your content
For example if you want to disallow only Googlebot then you need to do the following
1 2 3 4 5 | User-agent: * Allow: / User-agent: Googlebot Disallow: / |
How to Exclude all robots from the entire server
1 2 | User-agent: * Disallow: / |
How to disallow all bots other than Google from accessing a directory?
Here the /private directory is accessible only by Google bot
1 2 3 4 5 6 7 | User-agent: * Disallow: /private/ Allow: / User-agent: Googlebot Allow: /private/ Disallow: / |
How to allow all search engines to access only one file from a directory?
You can block all search engine bots from accessing a folder and a the same time allow access to one file from the same directory. See below, all bots are disallowed from accessing “private‘ directory but allowed to access the “file.html” file from the same directory
1 2 3 4 | User-agent: * Disallow: /private/ Allow: /private/file.html Allow: / |
Block all bots from accessing URLs with specific name pattern
How to block all bots from accessing files which ends with “private.php“? see below
1 2 | User-agent: * Disallow: *private.aspx |
Robots.txt and the sitemap file.
If you have created an XML-based Sitemap file for your site you can add a the reference to the location of your Sitemap xml file at the end of your robots.txt file
For example you can add the compressed form of sitemap as shown below.
1 2 3 | User-agent: * Disallow: Sitemap: http://www.yoursite.com/sitemap.xml.gz |
or the normal sitemap.xml
1 2 3 | User-agent: * Disallow: Sitemap: http://www.yoursite.com/sitemap.xml |
Before you go, subscribe to get latest technology articles right in your mailbox!.