July 12 2022

robots.txt - The most common mistakes and how to avoid them

The robots.txt file tells search engines how to crawl your site. In this article we explain the most common mistakes and how to avoid them.

Every webmaster knows that there are certain aspects of a website that you don't want to crawl or index. The file robots.txt gives you the opportunity to specify these sections and pass them on to search engine crawlers. In this article, we will show common errors that can occur when creating a robots.txt file, how to avoid them, and how to monitor your robots.txt file.

There are many reasons why website operators may want to exclude certain parts of a website from the search engine index, for example if pages are hidden behind a login, are archived or if you want to test pages of a website before they are published. "A Standard for Robot Exclusion”Was released in 1994 to make it possible. This protocol establishes guidelines that before starting the crawl, the search engine crawler should first look for the robots.txt file in the root directory and read the instructions in the file.

Many possible errors can occur during the creation of the robots.txt file, such as syntax errors if a statement is not written correctly or errors resulting from unintentional locking of a directory.

Here are some of the most common robots.txt errors:

Mistake n. 1: use of incorrect syntax

robots.txt is a simple text file and can easily be created using a text editor. An entry in the robots.txt file is always made up of two parts: the first part specifies the interpreter to which to apply the instruction (eg Googlebot), and the second part contains commands, such as "Disallow", and contains a list of all subpages that do not need to be scanned. For the instructions in the robots.txt file to take effect, the correct syntax must be used as shown below.

 

User-agent: Googlebot Disallow: / example_directory /

 

In the example above, the Google crawler is prohibited from crawling the / example_directory /. If you want this to apply to all crawlers, you should use the following code in your robots.txt file:

 

User-agent: * Disallow: / example_directory /

 

The asterisk (also known as a wildcard) acts as a variable for all crawlers. Similarly, you can use a forward slash (/) to prevent the entire website from being indexed (for example, for a trial version before putting it online for production).

 

User-agent: * Disallow: /

 

Mistake n. 2: block path components instead of a directory (forgetting "/")

When excluding a directory from crawling, always remember to add the slash at the end of the directory name. For instance,

Disallow: / directory not only blocks / directory /, but also /directory-one.html

If you want to exclude multiple pages from indexing, you need to add each directory on a different line. Adding multiple paths in the same line usually leads to unwanted errors.

 

User-agent: googlebot Disallow: / example-directory / Disallow: / example-directory-2 / Disallow: /example-file.html

robots txt values

Mistake n. 3: unintentional blocking of directories

Before the robots.txt file is uploaded to the website root directory, you should always check if its syntax is correct. Even the smallest mistake could result in the crawler ignoring the instructions in the file and leading to crawling of pages that shouldn't be indexed. Always make sure that directories that are not to be indexed are listed after the command Disallow :.

Even in cases where the page structure of your website changes, for example due to a restyle, you should always check the robots.txt file for errors.

Mistake n. 4 - The robots.txt file is not saved in the root directory

The most common error associated with the robots.txt file fails to save the file to the website root directory. Subdirectories are generally ignored as user agents only look for the robots.txt file in the root directory.

The correct URL for a website's robots.txt file must have the following format:

 

http://www.your-website.com/robots.txt

 

Mistake n. 5: Don't allow pages with a redirect

If the pages blocked in your robots.txt file have redirects to other pages, the crawler may not recognize the redirects. In the worst case scenario, this could cause the page to still appear in search results but with an incorrect URL. Additionally, the Google Analytics data for your project may also be incorrect.

Hint: robots.txt versus noindex

It is important to note that excluding pages in the robots.txt file does not necessarily imply that the pages are not being indexed. For example, if a crawled URL in robots.txt is linked to an external page. The robots.txt file simply gives you control over the user agent. However, the following often appears instead of the Meta description as the bot is prohibited from crawling:

"A description for this result is not available due to this site's robots.txt file."

Figure 4: Example of a snippet of a blocked page using the robots.txt file but still indexed

As you can see, only one link on the respective page is enough for the page to be indexed, even if the URL is set to “Disallow” in the robots.txt file. Likewise, the use of the tag it may, in this case, not prevent indexing as the crawler was never able to read this part of the code due to the disallow command in the robots.txt file.

To prevent certain URLs from appearing in the Google index, you should use the tag , but still allow the crawler to access this directory.

Conclusions

We have seen and examined very quickly what are the main errors of the robots.txt file which in some cases can significantly compromise the visibility and positioning of your website, arriving in the most serious cases up to the total elimination of the SERP.

If you are thinking of not having such problems with the robots.txt file because you know how it works and you would never make improvised actions, you should know that sometimes the errors in the robots.txt file are the result of oversights in the CMS configuration such as WordPress or even malware attacks or sabotage actions aimed at making your site lose indexing and ranking.

The best advice we can give you is to constantly monitor the robots.txt file at least on a weekly basis and check its correct syntax and correct functioning when you notice alarm signals such as a sudden drop in traffic or the presence of search engines on the SERP. Research.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds owns the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Facebook, Inc. owns the rights to Facebook®; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV This site is not affiliated, sponsored, or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a registered trademark at European level by MANAGED SERVER SRL Via Enzo Ferrari, 9 62012 Civitanova Marche (MC) Italy.

JUST A MOMENT !

Would you like to see how your WooCommerce runs on our systems without having to migrate anything? 

Enter the address of your WooCommerce site and you will get a navigable demonstration, without having to do absolutely anything and completely free.

No thanks, my customers prefer the slow site.
Back to top