July 12 2022

Crawling budget and crawling budget optimization

How to optimize your website's crawl crawl budget for SEO?

Crawling Budget Optimization

The Crawl Budget or Crawling Budget is an often misunderstood term in SEO and digital marketing. This largely stems from a lack of understanding of how search engines work.

There are currently about a trillion pages on the World Wide Web. Organizing those pages and evaluating their relative value is one of the most challenging tasks that search engines face.

It's a problem for website owners when Googlebot doesn't crawl all the pages on their website. When this happens, it is usually due to one of two reasons:

  1. Google has limited resources and therefore has developed mechanisms to filter out low-quality pages and spam.
  2. Google often limits the number of pages it will crawl so your server doesn't crash.

So if Google spends its resources trying to crawl every page on your website, including low-quality ones, your most valuable pages may not be crawled. This is why it is necessary optimize your crawl budget .

In this article, we'll cover the fundamentals of crawl budget optimization and address the common problems website owners encounter when making their pages more crawlable.

What is crawl budget or crawl budget?

A crawl budget is the predetermined number of requests a crawler will make on a website within a given period.

This number indicates how many and which pages Googlebot will crawl your website.

The crawl budget is determined entirely by the search engine. Once the budget is exhausted, the web crawler automatically stops accessing your site's content and moves on to the next website.

The reason for allocating crawl budgets to websites is that search engines such as Google can crawl a limited number of web pages. To accurately take care of the millions of websites on the Internet, Google divides its resources between of them as fairly as possible. Each website's crawl budget is different and depends on several factors:

  1. Website size:  Larger websites are usually assigned with larger crawl budgets.
  2. Server configuration and site performance:  Server load times and site performance are also taken into account when allocating your crawl budget.
  3. Links on your site:  internal link structures play a vital role, and dead links or redirect chains can drain your scanning budget.
  4. Frequency of content updates on your site:  Google spends more time crawling sites with regular content updates.

The importance of the Crawl Budget for SEO

Managing your crawl budget isn't that crucial for relatively small websites with just a few pages, but it becomes a problem for those of moderate or larger size.

SEO involves making many small but collectively significant changes that affect the growth of your website over time, rather than making large changes to get quick results. Your job as an SEO professional or web administrator is to optimize thousands of little things as much as possible.

Search engines have limited resources and cannot crawl and index every web page they find on a huge and ever-changing Internet. This is why crawl budget becomes so important, especially for larger websites with many pages.

While crawling is less important for webmasters with smaller sites, even a website that looks small at first glance can contain thousands of URLs. Multi-faceted or faceted navigation, or tendentially by attributes and taxonomies, common in many online stores and e-commerce websites, can easily convert 100 pages into 10.000 unique URLs, which can become a problem when crawling and indexing. Furthermore, the presence of bugs in the CMS can produce undesirable results.

For SEO best practices it is generally recommended that all webmasters evaluate their website's crawl budget, regardless of its size or structure.

Understand the scanning process

Understanding what a crawl budget is and why it matters is one thing, but website owners and SEO teams also need to understand how Google crawls websites.

How search engines work 

Search engines like Google use three basic processes to catalog web pages: crawling, indexing, and ranking.

Scan: Search for information

Search engine crawlers begin by visiting websites from their list of web addresses obtained from previous scans and sitemaps, provided by various webmasters through tools like Google Search Console. The crawlers then use the links on the sites to discover other pages.

Indexing: organization of information

Subsequently, the crawlers organize the pages visited by indexing them. The web is essentially a giant library that grows every minute with no central storage system. Search engines view the content of the page and look for key signals that indicate what the web page is about (e.g. keywords). They use this information to index the page.

Ranking: Information about the service

Once a web page has been crawled and indexed, search engines provide user query results based on the ranking algorithm with indexed pages.

The details of the scan

Gary Illyes, Google's Webmaster Trends Analyst, gave us a clearer picture of the Googlebot crawling process in a blog post of the 2017. According to him, the crawl budget is mainly based on two components: the scan speed limit and the scan question

Scan speed limit

A crawl rate limit refers to how often your website is crawled.

Scanning consumes server resources and bandwidth limits assigned to the site by its host. This is why search engines like Google have systems to determine how often you visit websites, so that the site can be crawled sustainably.

This means that there is a limit to the number of times a given website will be crawled. The crawl rate limit prevents crawlers from disrupting your website's performance by overloading it with HTTP requests. This allows search engines to determine how often they can visit your website without causing performance issues.

This process also has drawbacks. Manually setting the scan speed limit can cause problems such as:

  • Low scan frequency - when new content on your website remains unindexed for long periods
  • High scan rate - when your monthly scanning budget is unnecessarily exhausted due to repeated scanning of content that should not be scanned.

This is why web administrators are generally advised to leave crawl speed optimization to search engines.

Scanning application

The crawl request determines the number of pages on a website that a crawler will visit during a single crawl. It is mainly influenced by the following factors:

  • URL popularity - The more traffic a page gets, the more likely it is to be indexed.
  • Dating Pages with regular content updates are considered new URLs and are more likely to be indexed than pages with infrequently updated content or “outdated URLs”.

Factors Affecting Your Scan Budget

Many factors determine the crawl budget, and many of them cause recurring problems for website owners. Understanding what they are will allow us to have a sort of list that we can evaluate and examine if our site begins to suffer from indexing and positioning drops that can be symptomatic of more or less serious errors that can almost always be resolved.

Faceted navigation or by attributes and taxonomies

Ecommerce websites often have dozens of variations of the same product and must provide a way to filter and sort them for users. They do this through a faceted navigation, creating unique and arranged URLs for each type of product.

While faceted navigation is very useful for users, it can create a number of problems for search engines. Filters applied often create dynamic URLs, which appear to web crawlers as individual URLs that need to be crawled and indexed. This can unnecessarily drain your crawl budget and create duplicate content on your website.

Session identifiers and duplicate content 

URL parameters such as session IDs or tracking IDs end up creating several unique instances of the same URL. This can also create duplicate content issues that damage your website rankings and drain your calling budget.

Soft Pages 404

Soft 404

A soft 404 occurs when a corrupted web page responds with an HTTP 200 OK status code instead of a 404 Not Found response code. This causes the crawler to attempt a crawl on that broken page and consume your crawl budget. This is a rather gross mistake but just as popular.

Poor server and hosting configuration

Poor server and hosting setup results can cause your website to crash frequently. The crawl rate limit prevents crawlers from accessing crash-prone websites. Therefore, they will often avoid websites hosted on poor server configurations.

As a Hosting of Dedicated Servers and WordPress Hosting and WooCommerce Hosting with sites with millions of pages, we can say that very often the server-side software stack is not adequate to return content quickly and quickly and with a Time To First bytes too high.

It is thanks to an application approach with a lean and performing development, flanked by Enterprise-grade Caching systems such as Varnish Cache, that these limits and criticalities can be overcome.

Rendering-blocking CSS and JavaScript

Every resource that a web crawler fetches when rendering your web page is accounted for in your crawl budget, including not only HTML content but also CSS and JS files.

Webmasters need to make sure that all these resources are cached by the search engine and minimize performance issues and that external style sheets don't cause problems like code splitting.

Broken links and redirects 

A broken link is an Ahref hyperlink that redirects the user or bot to a page that doesn't exist. Broken links can be caused by an incorrect URL in the link or a page that has been removed. When 301 redirects to each other in a sequence, it can frustrate human users and confuse search engine robots.

Whenever a bot encounters a redirected URL, it must send an additional request to reach the final destination URL. This problem becomes all the more serious the bigger a website is. A website with at least 500 redirects gives a crawler a minimum of 1.000 pages to crawl. A redirected link can send a crawler through the redirected chain, running out of crawl budget on unnecessary redirect hops.

Site speed and Hreflang tags

pagespeed insights

Your website needs to load fast enough for the web crawler to access your pages efficiently. These crawlers often switch to a completely different website when they encounter a page that loads too slowly; for example, if it has a server response time greater than two seconds.

Regarding the speed of a website, a lot has been said in various posts of our blog being in fact “Performance Managed Hosting” which is the claim that distinguishes us. However although there are many approaches and possibilities to speed up a site, a lot depends on the CMS used to implement what are the classic optimization techniques both on the hardware side and on the server side and above all on the application side.

Alternate URLs defined with the hreflang tag can also exhaust your crawl budget.

XML sitemap, aka the sitemap

Search engines like Google always prioritize crawl scheduling for URLs included in your sitemap over those that Googlebot discovers when crawling your site. This means that creating and submitting your website's XML sitemap to Google Webmasters is vital to its SEO health. However, adding each page to the sitemap can also be harmful, as the crawler having to prioritize all of your content consumes your crawl budget.

How to calculate the Crawl budger?

Tracking and calculating your crawl budget is tricky, but it can give you some very valuable information about your website.

First, you need to know how many pages you have. You can get that number from your XML sitemap, by querying the site with Google using site: yourdomain.com, or by crawling your website with a tool like Screaming Frog. Once you know how many web pages you have, open the Google Search Console for your website and find the Crawl Statistics report in the Previous Tools and Reports section.

This shows Googlebot activity on your site over the past 90 days. Here you can find the average number of pages scanned per day. Assuming the number remains consistent, you can calculate the crawl budget with the following formula:

average pages scanned per day × 30 days = scan budget

This information is very useful when you need to optimize your scanning budget. Divide the number of pages on your website by the average number of pages crawled per day.

If the result is a number greater than 10, it means you have 10 times more pages on your site than Google crawls per day, which means you need to optimize your crawl budget. If the number is less than 3, the scanning budget is already optimal.

Crawl budget optimization

Optimizing the crawl budget for your website simply means taking the proper steps to increase it. By improving some key factors that affect it, such as faceted browsing, outdated content, 404 errors, and 301 redirect chains, you can be well on your way to increasing your website crawl budget. That's how:

Optimize faced browsing

Multi-faceted browsing can eat up your crawl budget if not implemented correctly, but that shouldn't stop you from using it. You just need to do some tweaking to optimize it.

  • You can add a 'noindex' tag that informs bots about unindexed pages. This will remove the pages from the index but continue to waste your crawl budget on them.
  • Adding a “nofollow” tag to any faceted navigation link will prevent the crawler from indexing it, freeing up the crawl budget by removing those URLs immediately.

Remove outdated content 

Removing outdated content will free up a lot of your crawling budget. You don't need to physically delete pages that contain that content, you just need to prevent crawlers from accessing them like you did with faceted navigation links.

This would reduce the number of crawlable URLs in your index and increase your crawl budget.

Reduce 404 error codes 

To reduce the number of 404 error codes on your website, you need to clean up your broken links and send a 404 not found response code to the web crawler. This helps crawlers avoid accessing those links and, once again, increase your crawl budget by reducing the number of crawlable URLs for your site.

Fix 301 redirect chains 

Broken links and 301 redirect chains can also unnecessarily drain your crawl budget and cleaning them should be part of your regular website maintenance. To avoid this problem and increase your crawl budget, you need to improve your internal links and fix any pending redirect chains:

  1. Run a full scan of your website using a tool like Screaming Frog.
  1. After the crawl is complete, identify the redirected URLs and the source page where the specific link is located.
  1. Finally, update these links so that all links point directly to the destination URLs.

You should also avoid orphaned pages, which are present in the sitemap but are not linked internally, effectively blocking them within your website architecture.

Clean and update your sitemap

Check your sitemap at regular intervals for included non-indexable URLs and for indexable URLs that were incorrectly excluded from it.

Improve site speed and Hreflang tag 

Improving the speed of your website not only provides a better user experience, but also increases its crawl speed. Sites with slow loading speeds are often avoided altogether by Googlebot. Optimizing page speed involves a lot of technical SEO factors, but running them helps your crawl budget.

For example, you can see how it is possible to increase the hosting speed by using our service Optimization Core Web Vitals.

The use of in the page header allows you to highlight localized versions of pages to the crawler and avoids running out of crawl budget.

Use HTML where possible

While Googlebot has become more efficient at crawling JavaScript files along with indexing Flash and XML, this isn't the case with other popular search engines like Bing or DuckDuckGo. This is why it is always advisable to use HTML wherever possible, as all search engine bots can easily crawl HTML files.

Use Robots.txt to scan important pages

Using your website's robots.txt file is a very efficient way to optimize your crawl budget. You can manage your robots.txt to allow or block any page on your domain. Doing this with a website checker is recommended for larger websites where frequent calibrations are required.

Use instant indexing services like Index Now

IndexNow is an easy way for website owners to instantly inform search engines about the latest changes to content on their website. In its simplest form, IndexNow is a simple ping so that search engines know that a URL and its content have been added, updated, or deleted, allowing search engines to quickly reflect this change in their search results.

Without IndexNow, it could take days or weeks for search engines to find that the content has changed, as search engines don't crawl all URLs often. With IndexNow, search engines instantly know "URLs that have changed, helping them prioritize crawling of these URLs and thus limiting organic crawling to discover new content."

IndexNow is particularly useful if your site needs to rank very quickly, and especially if your site is a blog or an online newspaper or an editorial newspaper.

Use our hosting services to optimize your crawl budget.

Crawl budget optimization is an inexact science that involves a lot of moving parts and ongoing website maintenance tasks that can be very cumbersome.

Using Managed Server Optimized Hosting allows Google to easily crawl and index your website , regardless of whether it's made using HTML or JavaScript and regardless of how many web pages it has. When you automate your website's crawl budget optimization process, you free up mental bandwidth for more important tasks that focus on a higher level SEO strategy.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds owns the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Facebook, Inc. owns the rights to Facebook®; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV This site is not affiliated, sponsored, or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a registered trademark at European level by MANAGED SERVER SRL Via Enzo Ferrari, 9 62012 Civitanova Marche (MC) Italy.

JUST A MOMENT !

Would you like to see how your WooCommerce runs on our systems without having to migrate anything? 

Enter the address of your WooCommerce site and you will get a navigable demonstration, without having to do absolutely anything and completely free.

No thanks, my customers prefer the slow site.
Back to top