May 17, 2024

Why AWS is not the panacea for all ills: the 4-hour downtime of passionastronomia.it

Power is nothing without control (and without a Full Page Cache). Whoever causes his own evil should cry for himself.

In the world of web hosting and infrastructure management, AWS (Amazon Web Services) is often considered the benchmark for scalability and reliability. However, like any powerful tool, the success of AWS depends on its proper configuration and management. This article analyzes a concrete case of an astronomy site, passionastronomia.it, which suffered a 4-hour downtime despite being hosted on AWS. We will explore the causes of the problem, the use of CloudFront and CloudFlare, and the importance of Full Page Cache to improve performance and reduce resource consumption.

The Case: passionastronomia.it

A few months ago, one of our valued customers with a monthly traffic of approx 50 million page views, reported to us an emerging astronomy site that periodically had downtime problems due to too much traffic.

He, who followed the Passione Astronomia page, was informed by the posts published on Facebook by Passione Astronomia itself. Every time a post caused an overload and the site went offline, they reported it by bragging that the server couldn't handle the high traffic. However, for the owner of a site used to managing 50 million page views per month, with peaks of over 3 million per day, it is quite "curious" to see how a site that manages about a tenth of them, or about 6 million per month (according to SimilarWeb estimates), can go offline so frequently without anyone working to solve the problem.

Visits-SimilarWeb-Passion-Astronomy

This is really a problem, since normally, when a downtime occurs and the traffic directed to the site is unable to contact it or accumulates numerous 500 errors, browsers that use Chromium (the engine behind Chrome), obviously including Chrome , report the incident to Google. Consequentially, Google will send less traffic to that site in the following hours, knowing that something is not working properly. This can have a significant impact on the visibility and performance of the site, further penalizing its traffic and reliability in the eyes of users and search engines, as well as obviously damaging its image.

Furthermore, even when the site is online, it does not seem to guarantee who knows what performance with a Time To First Byte of around 500ms at 8 in the morning with the site practically little visited and with no posts launched that generate traffic. Below you can, for example, view the TTFB test for Europe with the SpeedVitals.com TTFB test

SpeedVitals-Passion-Astronomy

Below you can view our TTFB of Managedserver.it, a site that we take care of obsessively to obtain the best performance.

SpeedVitals ManagedServer.it

We are talking about a TTFB (Time to First Byte) that is 10 times lower and well below the maximum 200 ms that Google considers acceptable before reporting that the server response time is too high, indicating the need to "Improve the server response time”.

Improve-Server-Response-Time---Google-PSI

Furthermore, the slowness of such a high TTFB, to the point of frequently becoming unattainable at certain times of the day, including prolonged downtime, negatively affects the Core Web Vitals which we can see from the Google PageSpeed ​​Insight report below.

PageSpeed-Insight-Passioneastronomia.it

A high Time to First Byte (TTFB), as shown in the image with a value of 2,3 seconds, can negatively affect other parameters of the Core Web Vitals. A high TTFB means there is a significant delay before the browser starts receiving data from the server. This delay contributes to a high First Contentful Paint (FCP) (3,4 seconds) and a high Largest Contentful Paint (LCP) (4,7 seconds), because the initial rendering of the page is delayed. Additionally, a high TTFB can negatively impact Interaction to Next Paint (INP) and Cumulative Layout Shift (CLS), as slow page loading can lead to a less fluid user experience and more frequent layout shifts.

Accustomed to managing significant traffic peaks, even over 2 million visits per hour and 200 thousand per minute, and specialized in the optimization of Core Web Vitals, we have offered our services and our expertise to the owners of passionastronomia.it by contacting them to propose managed management of their infrastructure or migrate to our high-performance hosting, however, our offer was probably misunderstood and underestimated, as well as rejected without too many compliments and mincing words.

Yet our site boasts an optimal TTFB of just 22ms in Italy and a PageSpeed ​​that brilliantly passes the test of Core Web Vitals as you can see below.

PageSpeed-Insight-Managedserver.it

With a significant publishing client portfolio and the numbers mentioned above, which could be assessed impartially using the tools made available by Google, it seemed incredible that, in a moment of difficulty characterized by continuous downtime, they hadn't taken the opportunity to resolve all the problems in a flash. This could have happened at a cost that was probably half that of their current supplier, who often saw them offline or performing poorly, as indicated by the tests above.

Passioneastronomia.it and WordPress

Passioneastronomia.it is an editorial site developed in WordPress, based on PHP and MySQL. Although these technologies are now considered dated compared to more modern and asynchronous languages ​​and technologies such as Node.js, Go, or NoSQL databases such as Cassandra, MongoDB or REDIS, they continue to be widely used. PHP, for example, was developed in the 90s and has undergone numerous updates over time, but remains less performant in high-load situations than modern asynchronous languages. Similarly, MySQL, despite being a robust relational database management system, can encounter difficulties in terms of scalability and performance when handling large volumes of data and concurrent requests.

Despite these limitations, WordPress remains the most versatile and easy-to-implement content publishing system available today. Its large developer community, myriad of available plugins, and intuitive interface make it a popular choice for small blogs and large businesses alike. Even major newspapers such as Il Fatto Quotidiano they use WordPress as their CMS for editorial publication, demonstrating that, with the right configuration and optimization, it is possible to obtain excellent performance even with technologies considered less modern.

However even NASA, the highest expression in the collective imagination of scientific dissemination, uses WordPress as a CMS.

NASA Banner

 

Passioneastronomia.it initially (at least since they told us about it about a year ago), was hosted on SiteGround, a hosting service that is not always able to handle traffic of this level. Then in April, the site moved to AWS, the choice of the market and many big names such as Netflix, Airbnb, Spotify, Twitch, LinkedIn, Adobe, Slack and the BBC.

  1. Netflix: Uses AWS for video streaming, data storage and analytics.
  2. airbnb: Leverages AWS to manage its global hospitality platform and online marketplace.
  3. Spotify: Use AWS for streaming music delivery, data analytics, and machine learning.
  4. Twitch: The Amazon-owned video streaming platform relies on AWS for live video transmission and data management.
  5. LinkedIn: Although some of its infrastructure is internal, LinkedIn uses AWS for some of its data storage and analytics services.
  6. Adobe: Offers its Creative Cloud services and other SaaS applications through AWS.
  7. Slack: Uses AWS for its messaging and collaboration platform.
  8. with the BBC: The BBC uses AWS for on-demand video content delivery and data management.

Despite the power and scalability of AWS,passionastronomia.it continued to experience downtime, culminating in a 4-hour outage on May 16, from 19pm to 00pm, monitored by our Uptime Kuma tool.

Downtime-passioneastronomia.it
The screenshot highlights a period of significant downtime that occurred on May 16, 2024. The red section in the graph shows that the downtime began at 19:00 PM and continued until 23:13 PM, for a total of over 4 hours of inactivity.

The causes of Downtime

The main problem lies not in AWS as an infrastructure, but in the configuration and management of the service. AWS offers a wide range of services, including EC2, databases, load balancers, distributed file systems, and CDNs like CloudFront. However, if used inappropriately, AWS can lead to serious performance problems.

The passionastronomia.it site did not have an adequate Full Page Cache system such as CloudFront or the Pro or Business version of CloudFlare with HTML Cache support, or even a “simple” Varnish. The site's response headers clearly show that HTML caching was not enabled:

HTTP/2 200
date: Thu, 16 May 2024 22:02:18 GMT
content-type: text/html; charset=UTF-8
set-cookie: AWSALBTG=gkrvkTmzuBE7uhKteG6ihiCGQH60BIdF48ki+7cvKP9ia2ltc4cAgn5dVM5l+/WaO8fbb8dzylYF1OYP7PZnmhHdLsauuVLuLntiKviIt8EAxKNbM3yBSyKqrMaGu1SXAQaPGkLnjoHwqz3OkmDHAVqvBB0V3v4d0WOcshbhqixspvTJTic=; Expires=Thu, 23 May 2024 22:02:17 GMT; Path=/
set-cookie: AWSALBTGCORS=gkrvkTmzuBE7uhKteG6ihiCGQH60BIdF48ki+7cvKP9ia2ltc4cAgn5dVM5l+/WaO8fbb8dzylYF1OYP7PZnmhHdLsauuVLuLntiKviIt8EAxKNbM3yBSyKqrMaGu1SXAQaPGkLnjoHwqz3OkmDHAVqvBB0V3v4d0WOcshbhqixspvTJTic=; Expires=Thu, 23 May 2024 22:02:17 GMT; Path=/; SameSite=None; Secure
cf-edge-cache: cache,platform=wordpress
link: <https://www.passioneastronomia.it/wp-json/>; rel="https://api.w.org/"
link: <https://www.passioneastronomia.it/wp-json/wp/v2/pages/49>; rel="alternate"; type="application/json"
link: <https://www.passioneastronomia.it/>; rel=shortlink
vary: Accept-Encoding
cf-cache-status: DYNAMIC
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=dvOaq3CI0sDGSeXsjICpJUT0nF1zmp%2BFj1TKPfqJKvyRd%2FuybtNnkc9FKE6SLu7CldFY7brUo1HZwF1kvPoDyisecYzzZC3aI%2FlrjkKCKcs%2BD4LVRylVZMmSQViSYJRA6fO9JU1sg2ejKUk%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 884ea6b518c20e4d-MXP
alt-svc: h3=":443"; ma=86400

The Free version of CloudFlare does not support HTML Cache in DYNAMIC mode, causing excessive load on AWS servers with continuous HTML requests passing directly through PHP processes and database connections.

In the first case, when the load increases significantly, the origin Web Server reaches saturation, being unable to even respond and causing the reverse proxy of the CloudFlare CDN to time out as in the following image:

CloudFlare-Gateway-Timeout

In the second case, even if the Web Server manages to respond for a few moments, certainly the load of SQL queries to the MySQL database will return a connection error to the Database as in the screenshot below.

Error-Database-WordPress

More succinctly, it is correct to say without going too deeply into the world of DBMS, management of tables, queries and indexes that this lack of adequate caching has several negative implications, especially for high traffic sites such aspassionastronomia.it.

Implications of HTML Cache Lack

  1. Increased load on Origin Servers:Without HTML caching, each HTML page request must be processed directly by the origin servers. This means that each visit to the site activates PHP processes to generate the dynamic content of the page, significantly increasing the load on the servers. As a result, the server must handle a large number of concurrent requests, leading to rapid exhaustion of available resources.
  2. Database overload:Each HTML request can involve a series of database queries to retrieve the data needed to generate the page. As traffic increases, the number of database connections can exceed the manageable limit, causing the database to slow down or even crash. This database overload is often the root cause of prolonged downtime.
  3. High Response Times:Without caching, the time required to generate an HTML page can be significantly longer. Each request requires loading and executing PHP scripts, interacting with the database, and dynamically generating content. This process is much slower than serving a cached page, leading to high response times and a suboptimal user experience.
  4. Risk of Downtime:When origin servers are constantly under pressure due to the high number of requests, the risk of downtime increases. Servers can become unresponsive or even crash if the load exceeds their capacity to handle. This is exactly what happened to passionaastronomia.it, where the site experienced significant downtime due to the inability to handle high traffic without a proper caching system.

Inappropriate Use of AWS EC2

Although AWS EC2 is a powerful solution, using an EC2 instance, even a large Etra, without proper configuration can be ineffective. An EC2 instance is nothing more than a virtualized instance with a certain amount of cores and RAM, which offers significant computational resources. However, without proper traffic management and caching, these resources can quickly become overloaded. Poor configuration can turn a powerful instance into a bottleneck, unable to handle high traffic spikes.

For example, in the absence of load balancing mechanisms, a single server can become overwhelmed with too many concurrent requests, rapidly exhausting available CPU and memory. Furthermore, without an effective caching strategy, each user request must be processed completely by the origin server, involving PHP processes and database queries to generate the dynamic pages. This not only increases response times, but it also puts a strain on the database, which can become a critical failure point although as we can see, the Database is hosted on a Managed Database service such as RDS.

WordPress-Error-Database

MySQL RDS (Relational Database Service) by Amazon AWS is a managed database service that makes it easy to set up, operate, and scale a MySQL database in the cloud. With MySQL RDS, Amazon takes care of administrative tasks such as hardware, operating system, database patches, and backups. This allows users to focus more on application development and less on infrastructure management.

Even the best EC2 instances can therefore fail if they are not supported by a solid backend architecture. To maximize the efficiency of an EC2 instance, it is essential to implement application-level caching systems and use CDNs like CloudFront or CloudFlare to distribute the load. Additionally, configuring Auto Scaling to automatically adapt to traffic changes can prevent sudden overloads. Only with careful management and optimal configuration, an EC2 instance can fully exploit its potential, ensuring high performance and reliability.

What are CloudFront and CloudFlare?

CloudFront

CloudFront is a CDN (Content Delivery Network) offered by AWS that delivers content around the world, reducing latency and improving page loading speed. This service is particularly useful for high-traffic sites such as passionastronomia.it, as it is capable of caching entire HTML pages, reducing the load on origin servers and significantly improving the user experience. Delivering content via a global network of edge nodes brings data closer to end users, resulting in faster response times and greater reliability. Additionally, CloudFront offers advanced features like DDoS protection and integration with AWS Shield and AWS WAF, which help improve site security. Using CloudFront also allows you to better manage traffic spikes, ensuring that origin server resources are not overloaded during periods of high demand. Its flexibility and scalability make it an ideal solution for improving site performance and offering a smoother and faster browsing experience to users.

CloudFlare

CloudFlare is another CDN service that offers a wide range of features to improve website performance and security. Unlike CloudFront, CloudFlare offers several caching options, including HTML Cache for Pro and Business versions.

The benefits of CloudFront and CloudFlare can be very similar, if not overlapping, CloudFlare's best-known advantage over CloudFront is that CloudFlare offers a Flat pricing plan with a basic plan of $25 per month which would have been more than enough to make the passionastronomia.it site perfectly compliant with the traffic peaks it receives on a daily basis.

CloudFront, on the other hand, offers a pay-as-you-go plan so the cost is directly proportional to the data moved.

In fact, both CloudFlare and CloudFront offer a Full Page Cache service in PaaS (Platform as a Service) mode, distinguishing themselves from Self Hosted solutions such as Varnish Cache. This difference is significant, as a PaaS service like the one provided by CloudFlare and CloudFront eliminates the need for you to manage and maintain caching infrastructure. With PaaS solutions, cache deployment and management are automated and integrated into the service, reducing administrative overhead and allowing teams to focus on other critical tasks.

For example, CloudFlare and CloudFront offer user-friendly interfaces for configuring caching rules, manage invalidations and monitor site performance, all without requiring in-depth technical server management skills. In contrast, Self Hosted solutions like Varnish Cache require manual configuration and ongoing management of the caching infrastructure. This entails the need to have qualified technical personnel capable of installing, configuring, monitoring and updating the caching software, as well as managing the physical or virtual servers on which it operates.

PaaS solutions from CloudFlare and CloudFront also offer a significant advantage in terms of scalability and reliability. Being cloud services, they can leverage the global network of distributed nodes to deliver high performance and low latency to end users, regardless of their geographic location. Delivering content across a global network of edge servers reduces latency and improves page load times, providing an optimal user experience.

Furthermore, CloudFlare and CloudFront integrate advanced security features, such as DDoS protection and integration with SSL certificates, which can be managed and configured directly from the platform without requiring manual intervention. This level of automation and integration makes it easier to secure web applications and provides greater peace of mind for site owners.

While Self Hosted solutions like Varnish Cache can offer more granular and customized control over cache configuration, they require a significant commitment in terms of resources and technical expertise. On the other hand, CloudFlare and CloudFront PaaS solutions provide a complete, automated and highly scalable service that simplifies cache management, reduces administrative burden and offers a higher level of performance and security.

Full Page Cache what is it?

Full Page Cache (FPC) is a caching technique that stores the entire HTML page generated by a website. This technique is essential for high-traffic sites as it drastically reduces the number of requests to the origin server, reducing the workload of PHP processes and database connections.

How it works

When a user visits a web page, the server generates the HTML page and caches it. Subsequent visits to the same page are served directly from the cache, avoiding page regeneration and reducing response times.

Conclusion

The lack of an HTML cache is a classic example of how the absence of an adequate caching strategy can put a strain on even the most robust infrastructures like AWS. For sites with high traffic, implementing an efficient caching system is crucial to reduce the load on origin servers, improve response times and prevent significant downtime. In this specific case, it would have been enough to use the $25 monthly version of CloudFlare, configured by a good systems engineer, to solve the downtime problems. Even better, implementing a Varnish Cache system with CloudFlare placed in front would have allowed for a double layer of cache. This configuration would have offered maximum performance at a cost that was certainly lower than the current one. Using such solutions, it is possible to achieve significant improvements in site performance and stability, ensuring an optimal user experience even during peak traffic.

This shows that beyond the provider of the services you decide to use, what really matters is the technical competence and expertise of the technicians who are responsible for implementing the best caching strategy in order to improve performance and content delivery.

Below is the traffic of our client's site that the editorial team of Passionastronomia.it reported to us for problems that we could have solved in 1 hour of work and consultancy, probably saving around 75% of the budget they are spending on AWS to go offline. But as they say in these cases, whoever is the cause of his evil should cry for himself.

JetPack statistics Visitor counter

It will be funny when the "bills" arrive from AWS and in particular from Amazon RDS for MySQL which is a Pay Per Use service and which will probably increase the costs significantly. Systems engineering is not child's play and as you can see inexperienced technical consultants can not only cause damage, but also drain your budget, not knowing how much more convenient it could be to spend by investing in CDN and Full Page Cache, rather than in Managed Database services. like AWS RDS.

We told him…

Updates two months later.

Following this post and the present ruminations, we believed it was correct (although not necessary) to share some technical suggestions with the Passione Astronomia company. We therefore sent an email that included extremely precise and useful technical advice for the current hosting scenario on Amazon AWS, therefore advising to install either a server-side Full Page Cache such as Amazon's CloudFront, or to enable the CloudFlare HTML Cache that they are already using in Free mode, perhaps using CloudFlare's APO directly at a low cost of around 5 dollars per month.

We confidently expected that within a few days we would see our precious technical suggestions implemented, however after almost two months, the analysis of the response headers shows no changes taking place, not even the shadow of Cloudfront, of CloudFlare which not even a shadow of HTML cache and a response time exceeding the maximum 200ms tolerated by Google.

Header-HTTP-Passioneastronomia

The proof of the 9 was given by the downtime of 8 July 2024, that is, yesterday, the evening in which our Uptime Kuma monitoring system reported to us a prolonged downtime of the site in question, with a total and complete offline of 2 hours and 20 minutes , as we can see from the following dashboard.

Downtime-PassioneAstronomia-8-July-2024

We were quite shocked by the fact that our advice, completely disinterested and based on years of experience, was neither received nor implemented. It is particularly surprising considering that some of the largest and most important publishing companies on the Italian scene are willing to pay several thousand euros to solve problems of stability, speed and performance that we are able to deal with effectively.

This leads us to a legitimate reflection: are the operators in the sector, as well as our possible "colleagues", really aware of the cornerstones necessary for the management of an important and successful project like this? Perhaps the time has come to think seriously about establishing a professional register with a state exam to qualify for professions such as systems engineering. Such a step would guarantee high standards of competence and responsibility, ensuring that only the most trained professionals can operate in a sector so crucial to the success of modern businesses.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds holds the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV Hetzner Online GmbH owns the rights to Hetzner®; OVHcloud is a registered trademark of OVH Groupe SAS; cPanel®, LLC owns the rights to cPanel®; Plesk® is a registered trademark of Plesk International GmbH; Facebook, Inc. owns the rights to Facebook®. This site is not affiliated, sponsored or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a trademark registered at European level by MANAGED SERVER SRL, Via Enzo Ferrari, 9, 62012 Civitanova Marche (MC), Italy.

Back to top