May 17, 2023

Google Crawl Stats and TTFB: A critically underappreciated relationship

Have you ever wondered how quickly Google Bots fetch content from your website?

google-crawl-statistics

One of the key aspects for the visibility of your website concerns the interaction with Google crawlers, the bots that scan your content to index it in search results. Two key elements of this interaction are Google's "Crawl Statistics" and the "Time to First Byte" (TTFB). Both of these factors can have a significant impact on how often Google visits your site and how quickly your content is indexed.

Google Crawl Statistics is a set of data that describes how Google's crawlers interact with your site. This data can include the number of crawl requests made, the time it took to download a page, the success or failure of the requests, and more. In general, your goal should be to have as low a crawl time as possible. A lower crawl time means that Google bots can crawl more pages in less time, which in turn can increase how often your site is indexed.

Time to First Byte (TTFB) is another key metric in the SEO world, we have extensively discussed it in this post. This is the time between when a client (such as a web browser or Google crawler) makes an HTTP request and when it receives the first byte of data from the server. Again, lower TTFB is generally better – it means data is transmitted faster, which can improve both the user experience and the effectiveness of Google's crawling.

A common belief is that the Google Bots crawling your site come from European data centers, however, a closer look reveals a different reality. Looking at your webserver's access.log files, it is possible to geolocate the IPs of the crawlers and discover that many of them come from the United States, in particular from Mountain View, California. This adds an additional latency of around 100ms, as can be seen by pinging the IP.

For example, let's consider this Google Bot IP 66.249.64.239 that we retrieved from a recent log file:

66.249.64.239 - - [14/May/2023:03:37:25 +0200] "GET /wp-includes/js/jquery/jquery.min.js HTTP/2.0" 200 31017 "https://www.ilcorrieredellacitta .com/news" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/113.0.5672.63 Safari/537.36"

Most of you probably know that Google is in Mountain View and have heard of it thousands of times, yet few have any knowledge of where Mountain View is geographically.

Mountain View is a city located in the Silicon Valley region of Santa Clara county in the state of California in the United States of America. Located on the west coast of the American continent, Mountain View overlooks the Pacific Ocean. Its geographical position is diametrically opposite to Europe: if you consider the journey from east to west, Europe is on the other side of the Atlantic Ocean, thousands of kilometers away. This distance is amplified by the presence of the American continent between the two points.

Distance-Mountain-View-Europe

The distance obviously involves a time to travel and even if we wanted to take the optical fiber as a means of propagation which has a speed close to that of light (but not that of light), the times of the various network devices must necessarily be added, such as routers and switch of the various HOPs as per the following traceroute for example.


Let's now try as an additional test to ping the IP 66.249.64.239 to measure the reverse latency time, i.e. from the web server to the Google bot.

Latency-Google-Bot

We clearly see a latency of 104ms, and indicatively compared to an efficient European section that pings in a maximum of 20ms, we have an "extra" of about 80ms.
This latency may seem small, but it certainly adds up, and over time, especially if your site has many pages or is frequently visited by crawlers.

A direct consequence of higher crawl times is a visible decrease in crawl requests from Google. This is due to a limitation known as the "crawl budget", which is the number of pages that Google can and will crawl in a given period of time. If your site takes too long to respond to crawler requests, Google may decide to limit the number of pages it crawls, which could in turn reduce your site's visibility in search results.

For example, let's see a report from one of our former customers who recently migrated his site to a well-known high-performance hosting provider (at least they say so), noting what it entailed in practical terms.

It is self-evident and extremely evident that following the supplier change, the orange line of the average response time has literally soared, going from a good value of 160 milliseconds to a good 663 milliseconds, or more than 4 times slower. In other words, if you want to count the servant, Google bots take 4 times more time to retrieve the contents from the site than they used to before.

In fact, if we see the total scan requests before the migration and after the migration, we notice how they have gone from about 30 to just 10, with a loss of about 2/3, an extremely worrying value especially on editorial sites that tend to produce a lot of content every day.

By repeating the analysis of Google's scanning statistics, after about 20 days, we see that the average response time has gone from the optimal 160ms to the current 1550, with peaks of over 300 milliseconds, with a worsening of the scanning speed from 10 to 20 times.

The requests for scanning have obviously worsened further, going from about 30 to just 5000, effectively decreeing the death of the site on search engines, Google News and Discovery.

With regard to the TTFB of search engines, we can find an eloquent testimony through the SpeedVitals tool which reports the following metrics for what concerns European values:

SpeedVitals TTFB values

 

Despite the importance of these metrics, many site owners and engineers aren't fully aware of their impact. Some may not know what Google crawl stats are, while others may not understand the relationship between TTFB and crawl rate. This lack of awareness can lead to suboptimal decisions regarding site hosting, site structure and search engine optimization, as well as being able to correctly evaluate a hosting provider like us who appears to a layman or a self-styled expert "sells hosting like everyone else".

For example, a site owner could choose a hosting provider based mainly on price, or on amazing commercial promises and sensational payoffs, without taking into account the fact that beyond marketing and beautiful promises, it is the facts that qualify the goodness of a provider, as well as that slower or more distant servers can increase TTFB and therefore reduce the effectiveness of Google crawls. Similarly, a tech might focus on things like keyword selection or site design, overlooking the fact that complex site structure or inefficient code can increase crawl time and reduce crawl budget, making even the best intentions.

Google crawl stats and TTFB are two critical factors that can greatly affect your site's visibility in search results. To optimize your site's interaction with Google's crawlers, it's essential to understand these metrics and factor them into any decisions about your site. This can include choosing a suitable hosting provider like ours, optimizing your server-side software stack, and constantly monitoring your crawl stats and TTFB.

Scan Statistics Report

CDN services as a solution or palliative to the problem.

At least theoretically, one of the most effective strategies for improving scan time is the implementation of a robust software stack, including a properly configured server-side caching system with as high a HIT ratio as possible.

However, even with the best server-side configuration, performing operations such as accelerating the SSL session handshake, enabling OCSP Stapling, TCP BBR and TCP Fast Open, painstaking tuning of the Kernel (as we usually do in Managed Server Srl) we are inevitably faced with the limit imposed by physical distance. This is especially true for sites hosted in Europe that need to interact with Google's crawlers based in Mountain View, California.

To overcome this challenge, many industry professionals are turning to Content Delivery Network (CDN) services in Platform as a Service (PaaS) mode. These solutions offer the possibility of having cached copies of site content on nodes closer to Mountain View through the use of AnyCast routing.

A CDN is a network of geographically distributed servers that work together to deliver web content quickly. These servers, known as points of presence (POPs), store copies of website content. When a user or bot requests access to this content, the request is sent to the nearest POP, thus reducing the time required to transmit the data.

Origin, in the context of a CDN, is the server or group of servers from which the original content comes. These origin servers deliver content to the CDN's POPs, who in turn distribute it to end users.

AnyCast routing is a method of routing network traffic where requests from a user are routed to the closest node in terms of response time. This can help significantly reduce TTFB since data doesn't have to travel long distances before reaching the user or bot.

In the context of computer networks, Anycast is a network traffic routing method that allows multiple devices to share the same IP address. This routing method is particularly popular in Content Delivery Network (CDN) services and Domain Name Systems (DNS), as it helps reduce latency and improve network strength.

To understand how Anycast routing works, imagine a network of servers distributed in different geographical locations around the world. Each of these servers shares the same public IP address. When a user sends a request to that IP address, the request is routed to the server "closest" to that user. Here, the term "closest" doesn't necessarily refer to the physical distance, but rather the distance in terms of network hops or the shortest response time. In practice, this means that the user will connect to the node that can respond to his request fastest.

The beauty of Anycast routing is that it is completely transparent to the end user. It doesn't matter where the user is or which server answers his request: the IP address is always the same. This makes the system extremely flexible and resilient. Should a server go offline or become overloaded, traffic can easily be redirected to another server without disruption to the user.

However, it is important to point out that “CDN” is an umbrella term that can refer to a wide variety of services, each with its own quirks. Not all CDNs are created equal, and to get the maximum benefits from a CDN, it's essential to understand its specifics and configure it correctly.

In many cases, using common commercial CDNs may not bring the expected benefits. For example, while a CDN may have POPs near Mountain View, misconfiguration or suboptimal CDN design can result in these POPs not having a cached copy of site content. In this case, requests from Google crawlers are redirected to the origin, potentially degrading response speed.

This problem can be especially serious if the origin doesn't have a local caching system capable of serving content to crawlers within milliseconds. In that case, using a CDN can end up slowing down requests rather than speeding them up, negatively impacting crawl time and, consequently, site visibility in Google search results.

The problem can be further exacerbated if the CDN is not properly configured for caching. Some CDNs offer a large number of configuration options, each of which can have a significant impact on performance. If these options are not configured correctly, the CDN may not be able to serve cached content effectively, which can lead to longer response times.

The best way to monitor Google Crawl Stats.

Regardless of the technologies used and the vendors involved, the most effective way to monitor Google crawl stats and average response time is through the use of Google Search Console. The Crawl Stats panel provides detailed and up-to-date information about Google's crawling activity of your website. We recommend that you access this section on a weekly basis to closely monitor your site's performance.

For optimal operation, the average response time should always be less than 200 milliseconds. This value represents the maximum theoretical time it should take for Googlebot to get a response from your server, and a lower value means that Googlebot can crawl your pages more efficiently, improving your crawl budget and consideration by Google. You can track your average response time through the link at Scan Statistics of Google Search Console.

The 200ms value is not a random value, but rather the maximum value explicitly indicated by Google https://developers.google.com/speed/docs/insights/Server?hl=it, and should be a best practice to follow regardless as outlined in the suggestions for improving server response.

Improve-Server-Response-Time---Google-PSI

If you notice that your average response time is over 200 milliseconds, or if you're having trouble with your scan stats, please don't hesitate to contact us. Our experts are available to help you find a quick solution and improve the performance of your website. Remember, a fast and responsive website is not only beneficial for your users, it is also a key factor in good SEO.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds owns the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Facebook, Inc. owns the rights to Facebook®; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV This site is not affiliated, sponsored, or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a registered trademark at European level by MANAGED SERVER SRL Via Enzo Ferrari, 9 62012 Civitanova Marche (MC) Italy.

JUST A MOMENT !

Would you like to see how your WooCommerce runs on our systems without having to migrate anything? 

Enter the address of your WooCommerce site and you will get a navigable demonstration, without having to do absolutely anything and completely free.

No thanks, my customers prefer the slow site.
Back to top