Many AIs do not respect the directives of the robots.txt file. The unauthorized use of web content by AI companies. - 🏆 Managed Server

BLOG

June 25 2024

Many AIs do not respect the directives of the robots.txt file. The unauthorized use of web content by AI companies.

Unauthorized use of web content by AI threatens the media industry, causing resource overload and crashes.

The rapid development of artificial intelligence (AI) has opened new frontiers in information processing, but has also raised significant ethical and legal questions. Recently, it has emerged that several AI companies are ignoring web standards for content acquisition, such as the “robots.txt” protocol, raising concerns among publishers and digital content experts. This article will explore the implications of these practices, analyzing the consequences for the media industry and discussing possible solutions.

Context and meaning of the “robots.txt” protocol

The “robots.txt” protocol was introduced in the 90s to allow website owners to control which parts of their site could be indexed by search engine crawlers. This standard has become a mainstay for ensuring that web content is not overloaded with automated requests, while protecting the rights of content owners.

The robots.txt Directives and the Crawl Delay

The “robots.txt” file not only indicates which pages a bot can and cannot visit, but also offers crucial directives such as “crawl delay”. The “crawl delay” is a parameter that specifies the delay that a bot must respect between one request and another to the server. This directive is crucial to prevent a website from being overloaded with requests, which could cause a significant increase in CPU load and server resources.

Robots.txt

The problem of AI companies ignoring directives

Many AI companies do not comply with these guidelines, causing a significant increase in load on website servers. This problem is especially acute for large sites with hundreds of thousands of pages or products. When several bots, both legitimate and AI, crawl a site simultaneously, CPU load can grow exponentially, reaching unsustainable levels. Additionally, the load on the database increases dramatically, with continuous queries overloading database resources. PHP processes, often used to generate dynamic content, can slow down or even crash, making the situation even worse.

Case Study: Real Impact on Server Resources

A practical example of this issue concerns one of our customers, who experienced significant overhead due to simultaneously scanning more than eight emerging AI bots. These bots continued to crawl the site for over eight hours, causing CPU load to increase by more than 900% compared to normal levels in the past few months. This overload led to slow site performance and risked causing a complete crash.

The Perplexity case and publishers' response

An emblematic example of this problem is the conflict between Forbes and Perplexity, an AI search startup that develops tools to generate automatic summaries. Forbes has publicly accused Perplexity of using its investigative articles to generate AI summaries without permission, bypassing restrictions imposed by the “robots.txt” protocol. An investigation by Wired confirmed that Perplexity is likely bypassing the protocol to bypass the blocks.

This case has raised significant alarms in the News Media Alliance, a trade group representing more than 2.200 publishers in the United States. President Danielle Coffey highlighted how failure to stop these practices could seriously undermine the media industry's ability to monetize its content and pay journalists.

The role of TollBit

In response to these problems, TollBit has emerged, a startup that positions itself as an intermediary between AI companies and publishers. TollBit monitors AI traffic on publisher websites and uses advanced analytics to help both parties negotiate licensing fees for content usage.

TollBit reported that not only Perplexity, but numerous AI agents are bypassing the “robots.txt” protocol. The company has collected data from multiple publishers that shows a clear pattern of protocol violations by different AI sources, indicating a widespread problem in the industry.

The legal implications and future prospects

The “robots.txt” protocol has no clear legal enforcement mechanism, which complicates publishers' ability to defend against these practices. However, there are signs that some groups, such as the News Media Alliance, are exploring possible legal action to protect their rights.

Meanwhile, some publishers are taking different approaches. For example, The New York Times has taken legal action against AI companies for copyright infringement, while others are signing licensing deals with AI companies willing to pay for content. However, there is still wide disagreement about the value of materials provided by publishers.

Conclusion

The unauthorized use of web content by AI companies represents a significant problem for the media industry. As AI technologies continue to evolve, it is crucial to establish a balance that protects the rights of content creators while ensuring technological innovation. Initiatives like TollBit's and possible legal action could be important steps towards a fair solution, but dialogue between the parties involved remains essential to building a sustainable future for all.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds holds the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV Hetzner Online GmbH owns the rights to Hetzner®; OVHcloud is a registered trademark of OVH Groupe SAS; cPanel®, LLC owns the rights to cPanel®; Plesk® is a registered trademark of Plesk International GmbH; Facebook, Inc. owns the rights to Facebook®. This site is not affiliated, sponsored or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a trademark registered at European level by MANAGED SERVER SRL, Via Enzo Ferrari, 9, 62012 Civitanova Marche (MC), Italy.

Back to top