Every website can use the robots.txt file to give some indications to Google bots on how to crawl the site. Optimally configuring the robots.txt is a priority to control the crawling budget, prevent some URLs of the site from being indexed or provide a sitemap with a list of URLs to be indexed and positioned as a priority.
Let’s see what is the robots.txt file, how it is created and what elements it consists of.
What is robots.txt file?
The robots file can be edited and configured with a simple text editor such as Windows Notepad since it is a text file. According to Google’s own definition, robots.txt is a file located in the root directory of a website that communicates the parts of the site that are not to be indexed, i.e. it indicates which parts of the website should not be accessed by crawlers.
The robots.txt uses the robots exclusion standard, a protocol that uses a series of parameters or commands to indicate access to the website by sections and by types of crawlers (mobile or computer, for example).
Each website has a robots.txt file in the root of its directory, being the first place where Google crawlers and other search engines go to consult their indications before examining the site.
How does the robots.txt file work?
The first thing to understand about the robots.txt file is that it is merely an indication, so search engines can ignore its indications, although it is normal that they follow them to optimize their crawling time.
However, Googlebots do pay attention to the robots.txt file so its optimization is essential for SEO (other search engines such as ASK, Yandex or Alltheweb do not always follow their crawling indications).
When a series of commands are included to indicate that a part of the website should not be crawled, Google bots, when crawling the site, will first consult the robots.txt file indications and proceed to ignore or not index all blocked URLs.
Types of robots
How do we create the robots.txt file?
The robots.txt file is a text file that can be edited with a simple text editor (Notepad or notepad for Windows, gedit for Linux, or TextEdit for Mac).
The file consists of a series of commands and rules to offer information about the site to search engines. Some basic rules that should be known before creating this file are:
- There can only be one robots.txt file for each website.
- The file name must be exactly “ robots.txt ” so that it can be identified and read by search engines.
- The file must be located in the root or main directory of the hosting where the web page is hosted.
- If you have active subdomains, you can also use a robots.txt in each of them.
- The text format of the file must be UTF-8 encoding. If this format is not used, Google may ignore some characters and the rules included in the file will no longer take effect.
- The rules included in the file are case-sensitive, so putting disallow: /file.asp is not the same as putting disallow; FILE.asp.
- Comments can be included in the file using the “#” character.
When creating the robots.txt file it will be divided into different groups. Each of these groups includes a series of line-encoded rules or directives. In each group, the type of robot it is intended for is indicated with the User-agent command.
In these groups, you will define who the guidelines apply to, which folders or files can be accessed and which folders or files cannot be accessed.
When crawlers read this file they will do so sequentially, starting at the top.
Robots.txt file elements
Commands
The different commands or directives that can be included in robots.txt are:
- user-agent. This command must be included at least once in each group in the file and must include the name of the crawler to which the rules apply. It is included at the beginning of the rest of the commands. To consult the list of Google user-agents you can consult this link.
- disallow. Indicates a page or directory that should not be crawled or indexed by the search engine indicated in the user-agent. If a page is referenced that cannot be disallowed, its full URL must be included.
- allow. It performs the opposite function of disallowing, indicating to the user-agent that the directory or URL is traceable.
- Sitemap. The URL where the sitemap is located is indicated (indicate the complete and precise URL of where it is located). The sitemap contains a list of URLs to be indexed so including it in the robots.txt makes it easier for the crawlers.
Specific rules
1. Disable whole web crawling
User-agent: * Disallow: /
2. Allow access to a single search engine
User-agent: Googlebot-news Allow: / User-agent: * Disallow: /
We can see that the first group indicates that the content of the web is allowed to be indexed in Google news specifically, and in the second group, how the crawl is blocked for the rest of the bots.
3. Allow access to all engines except one
User-agent: Google-image Disallow: / User-agent: * Allow: /
4. Prevent crawling of a directory
To prevent bots from crawling an entire directory and the URLs it contains.
User-agent: * Disallow: /calendar/ Disallow: /junk/
5. Prevent single URL crawling
User-agent: * Disallow: /not-crawling.html
6. Block the use of an image
To prevent Google-image from using a certain image in its results pages:
User-agent: Googlebot-Image Disallow: /images/cat.jpg
7. Avoiding file type tracking
To block access and indexing by file type, such as JPG images.
User-agent: Googlebot Disallow: /*.jpg$
Testing the robots.txt file in Google
Google provides a free and specific tool to test the robots.txt files and confirm that they are properly configured. This tool known as the robots.txt Tester has a very simple operation that we can see with the following steps:
- Access the Official Google robots.txt tester page.
- In the option select a property, choose the URL of the website you want to test. If the web account is not linked to Google Search Console or Google Analytics it will be necessary to do so.
- After selecting the URL to analyze, its content will be displayed in a text editor, showing different information about it.
- At the bottom, there will be two icons showing the errors found (in red) and warnings of possible problems or conflicts (in orange).
- It is possible to edit the robot.txt right here and send it to modify it directly (downloading it and uploading it to the root directory to replace the old one).
- Another remarkable function of this robots.txt checking platform is the possibility of checking the permissions of the different URLs of the page for each of the Google bots (Googlebot, Google-image, Google-news, Google- video, Google-Mobile, Media partners-Google and Adsbot-Google).
This tester has some limitations, such as the inability to perform access or blocking checks for domains. In addition, is exclusive to the behavior of Google bots so it is not used to check the behavior of other bots when configuring the file.
Knowing how to configure the robots.txt file is a priority for any SEO expert as it allows you to indicate to the different Google bots what their behavior should be when crawling, positioning and indexing the URLs of a website.
In this file, a series of rules can be applied to prevent a URL or domain from being indexed or to indicate that it should be indexed. Although the different Googlebots follow the indications of this file in most cases, other search engines only use them as a reference and tend to ignore them in most cases, following their own crawling criteria.