October 16, 2023

Use and abuse of Crawl Delay

Importance and implications of Crawl Delay: a parameter that can protect your server but also compromise the visibility of your site in search results.

The world of Search Engine Optimization (SEO) is vast and constantly evolving. One of the most technical and often overlooked parts is managing search engine crawling. In this post, we will address a specific aspect: the use and abuse of Crawl Delay, a directive that can be inserted into the file robots.txt to control how often search engine crawlers access your website.

What is a Crawler?

A crawler, sometimes called a spider or bot, is automated software used by search engines such as Google, Bing, Yahoo and others to navigate the maze of the World Wide Web. Its main purpose is to explore and analyze websites in order to index them and therefore make them searchable via search engines. But how exactly does a crawler work and why is it so critical?

A crawler begins its work starting from a set of known URLs, called "seeds". From these initial URLs, the crawler examines the contents of the pages, reads the HTML code and identifies all the links present on the page. Once identified, these new URLs are added to a queue for later analysis. This process repeats recursively, allowing the crawler to discover more and more pages and add them to the search engine's index.

In addition to extracting links, crawlers are able to analyze other elements of web pages, such as meta tags, titles, images and even multimedia, to gain a more complete understanding of the site. This data is then used to determine the relevance of a page to a particular search query, thus influencing its ranking in search results.

The action of crawlers is fundamental for the creation and updating of search engine indexes. Without crawling, it would be virtually impossible for search engines to provide up-to-date and relevant results. Web pages, blogs, forums and all other forms of online content depend on crawlers to be “discovered” and then made accessible to Internet users through searches.

Risks of Excessive Crawling

The crawling process is undoubtedly crucial to ensuring that a website is visible and easily accessible through search engines. However, a high volume of crawling requests can pose a serious problem, straining the server's capabilities, especially if the server is not optimized or adequately sized to handle heavy traffic.

Sizing and Performance

A poorly sized server, with limited hardware resources such as CPU, memory, and bandwidth, is particularly vulnerable to overload caused by intensive crawling. This is even more true if the web application hosted on the server has not been optimized for performance.

Slow Queries and Resource Intensive

Factors such as poorly designed or overly complex database queries, or excessive use of resources to dynamically generate a web page, can further aggravate the situation. In an environment like this, a crawler sending a large number of requests in a very short amount of time can exacerbate bottlenecks, drastically slowing server performance. This can lead to longer loading times for end users and, in the worst case, make the website completely inaccessible.

Error 500 and Its Importance

A typical symptom of an overloaded server is HTTP error 500, a status code that indicates a generic error and is often a sign of internal server problems. The 500 error can serve as a warning sign, not only for site administrators but also for search engines. Google, for example, is able to modulate its crawling frequency in response to an increase in 500 errors. When Google's crawler detects a large number of these errors, it may decide to reduce the speed of its requests to minimize the impact on the server.

In this way, the 500 error takes on a dual importance: on the one hand, it serves as an indicator for website administrators that something is wrong with the system; on the other hand, it serves as a signal to search engines that you may need to reduce your crawling frequency to avoid further problems.

Crawl Delay: A Solution?

Il Crawl Delay is a directive that can be inserted into the file robots.txt of the site. It is used to indicate to crawlers a pause (expressed in seconds) between one request and another. For example, setting a Crawl Delay of 10 seconds, the crawler is told to wait 10 seconds between one request and the next.

User-agent: * Crawl-delay: 10

When Crawl Delay Becomes an Obstruction

While implementing Crawl Delay in a website's robots.txt file may seem like an effective strategy to mitigate the risk of server overload due to excessive crawling activity, on the other hand, this solution can also present non-negligible contraindications. Setting a delay in crawling times effectively means limiting the amount of requests a crawler can make in a given period of time. This can directly result in a delay in the indexing of new pages or in changes made to existing pages. In a context in which the speed with which content is indexed can influence its visibility and, consequently, traffic and conversions, a Crawl Delay that is too high can be counterproductive.

For example, imagine you just published a current news article or an important update about a product or service. In such a situation, you would want this information indexed as quickly as possible to maximize visibility and engagement. A Crawl Delay set too high could significantly delay this process, making your information less competitive or even irrelevant.

Google, one of the most advanced search engines, has the ability to dynamically modulate crawl speed in response to various factors, including the stability of the server from which the pages come. If Google detects an increase in 500 error codes, a sign that the server may be unstable or overloaded, the search engine is programmed to automatically reduce the frequency of its crawling requests. This is an example of how an intelligent and adaptive approach to crawling can be more beneficial than a rigid Crawl Delay setting, which does not take into account the variable dynamics that can affect a website's performance.

Crawl Delay Presets: A Bad Practice

Some hosting services, with a view to optimizing the performance and stability of the servers, set a default Crawl Delay value in the robots.txt file of the sites they host. For example, Siteground, a hosting provider known for specializing in performance-oriented WordPress solutions, applies this limitation as part of its standard configuration. While the intent may be to preserve server resources and ensure a smooth user experience, this practice is often not recommended unless there is a real and specific need to limit incoming connections by crawlers.

Crawl Delay Siteground

The reason is simple: every website has unique needs, dynamics and goals that cannot be effectively addressed by a “one size fits all” setup. Setting a default Crawl Delay can, in fact, hinder your site's ability to be indexed in a timely manner, potentially affecting your ranking in search results and, therefore, online visibility. In particular, for sites that update frequently or that require rapid indexing for topical or seasonal reasons, a generic limitation on crawling could be counterproductive.

Additionally, an inappropriate Crawl Delay can interfere with search engines' ability to dynamically evaluate and react to site and server conditions. As mentioned above, Google, for example, is able to modulate its crawling frequency in response to an increase in 500 errors or other signs of server instability. A rigidly set Crawl Delay could, therefore, make these adaptive mechanisms less effective.

So, while a host like Siteground may have the best intentions in wanting to preserve server performance through a default Crawl Delay, it is essential that website managers take into consideration the specific needs of their site and evaluate whether such a setting is really in their interest.

Impact on SEO

An inaccurate Crawl Delay setting can have serious consequences for a website's SEO. This parameter can slow down and limit the frequency with which search engine crawlers access and analyze your site. This reduction in crawl speed and frequency can cause delays in the indexing of new content, as well as updates of existing web pages in the search engine's database.

An often underestimated aspect is the effect of Crawl Delay on the so-called "crawl budget", which is the total number of pages that a search engine is willing to explore on a specific site within a certain period of time. Excessive Crawl Delay could consume this budget very quickly, leaving some pages unexplored and therefore unindexed. This is especially harmful for sites with a large volume of content that need regular and thorough crawling.

Furthermore, an incorrect Crawl Delay could cause crawlers to "abandon" the content retrieval phase, especially if you encounter difficulties in accessing the information in the given time. This means that important updates or new content may not be picked up by search engines, compromising the site's visibility in SERPs (Search Engine Results Pages).

These delays and problems in crawling and indexing can lead to reduced visibility in search results. This reduced visibility often translates into a drop in incoming traffic and ultimately a worsening of SERP rankings. All of this can have a negative knock-on effect on the competitiveness of your website, negatively affecting both traffic and conversion and, in the long term, the ROI (Return On Investment) of your online strategies.

Therefore, it is crucial to use Crawl Delay thoughtfully, taking into account both the needs of the server and the implications for SEO. Before making any changes to your robots.txt file, it is always advisable to consult an SEO expert for a complete assessment of your website's specific needs.

Conclusions

The management of the Crawl Delay It's a delicate task that must balance server needs and SEO needs. It is essential to carefully consider whether to introduce this directive, and if so, what value to set. An incorrect approach can have negative consequences for both server performance and SEO.

If your server is already optimized and the application performs well, adjusting the Crawl Delay it may not be necessary. In any case, it is always a good idea to constantly monitor server performance and crawler activity through tools such as Google Search Console or server logs, to make informed decisions.

Remember, the Crawl Delay it is just one piece in the complex mosaic of SEO and site performance. It should be used wisely and in combination with other best practices to ensure a strong and sustainable online presence.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds owns the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Facebook, Inc. owns the rights to Facebook®; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV This site is not affiliated, sponsored, or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a registered trademark at European level by MANAGED SERVER SRL Via Enzo Ferrari, 9 62012 Civitanova Marche (MC) Italy.

Back to top