25th June 2024

Many AIs do not respect the directives of the robots.txt file. The unauthorized use of web content by AI companies.

Unauthorized use of web content by AI threatens the media industry, causing resource overload and crashes.

The rapid development of artificial intelligence (AI) has opened new frontiers in information processing, but has also raised significant ethical and legal questions. Recently, it has emerged that several AI companies are ignoring web standards for content acquisition, such as the “robots.txt” protocol, raising concerns among publishers and digital content experts. This article will explore the implications of these practices, analyzing the consequences for the media industry and discussing possible solutions.

Context and meaning of the “robots.txt” protocol

The “robots.txt” protocol was introduced in the 90s to allow website owners to control which parts of their site could be indexed by search engine crawlers. This standard has become a mainstay for ensuring that web content is not overloaded with automated requests, while protecting the rights of content owners.

The robots.txt Directives and the Crawl Delay

The “robots.txt” file not only indicates which pages a bot can and cannot visit, but also offers crucial directives such as “crawl delay”. The “crawl delay” is a parameter that specifies the delay that a bot must respect between one request and another to the server. This directive is crucial to prevent a website from being overloaded with requests, which could cause a significant increase in CPU load and server resources.

Robots.txt

The problem of AI companies ignoring directives

Many AI companies do not comply with these guidelines, causing a significant increase in load on website servers. This problem is especially acute for large sites with hundreds of thousands of pages or products. When several bots, both legitimate and AI, crawl a site simultaneously, CPU load can grow exponentially, reaching unsustainable levels. Additionally, the load on the database increases dramatically, with continuous queries overloading database resources. PHP processes, often used to generate dynamic content, can slow down or even crash, making the situation even worse.

Case Study: Real Impact on Server Resources

A practical example of this issue concerns one of our customers, who experienced significant overhead due to simultaneously scanning more than eight emerging AI bots. These bots continued to crawl the site for over eight hours, causing CPU load to increase by more than 900% compared to normal levels in the past few months. This overload led to slow site performance and risked causing a complete crash.

The Perplexity case and publishers' response

An emblematic example of this problem is the conflict between Forbes and Perplexity, an AI search startup that develops tools to generate automatic summaries. Forbes has publicly accused Perplexity of using its investigative articles to generate AI summaries without permission, bypassing restrictions imposed by the “robots.txt” protocol. An investigation by Wired confirmed that Perplexity is likely bypassing the protocol to bypass the blocks.

This case has raised significant alarms in the News Media Alliance, a trade group representing more than 2.200 publishers in the United States. President Danielle Coffey highlighted how failure to stop these practices could seriously undermine the media industry's ability to monetize its content and pay journalists.

The role of TollBit

In response to these problems, TollBit has emerged, a startup that positions itself as an intermediary between AI companies and publishers. TollBit monitors AI traffic on publisher websites and uses advanced analytics to help both parties negotiate licensing fees for content usage.

TollBit reported that not only Perplexity, but numerous AI agents are bypassing the “robots.txt” protocol. The company has collected data from multiple publishers that shows a clear pattern of protocol violations by different AI sources, indicating a widespread problem in the industry.

The legal implications and future prospects

The “robots.txt” protocol has no clear legal enforcement mechanism, which complicates publishers' ability to defend against these practices. However, there are signs that some groups, such as the News Media Alliance, are exploring possible legal action to protect their rights.

Meanwhile, some publishers are taking different approaches. For example, The New York Times has taken legal action against AI companies for copyright infringement, while others are signing licensing deals with AI companies willing to pay for content. However, there is still wide disagreement about the value of materials provided by publishers.

Conclusion

The unauthorized use of web content by AI companies represents a significant problem for the media industry. As AI technologies continue to evolve, it is crucial to establish a balance that protects the rights of content creators while ensuring technological innovation. Initiatives like TollBit's and possible legal action could be important steps towards a fair solution, but dialogue between the parties involved remains essential to building a sustainable future for all.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

DISCLAIMER, Legal Notes and Copyright. RedHat, Inc. holds the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of the AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. holds the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds holds the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt; Oracle Corporation holds the rights to Oracle®, MySQL®, MyRocks®, VirtualBox®, and ZFS®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; PostgreSQL® is a registered trademark of PostgreSQL Global Development Group; SQLite® is a registered trademark of Hipp, Wyrick & Company, Inc.; KeyDB® is a registered trademark of EQ Alpha Technology Ltd.; Typesense® is a registered trademark of Typesense Inc.; REDIS® is a registered trademark of Redis Labs Ltd; F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB; HAProxy® is a registered trademark of HAProxy Technologies LLC; Traefik® is a registered trademark of Traefik Labs; Envoy® is a registered trademark of CNCF; Adobe Inc. owns the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited; Automattic Inc. holds the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert owns the rights to Drupal®; Shopify® is a registered trademark of Shopify Inc.; BigCommerce® is a registered trademark of BigCommerce Pty. Ltd.; TYPO3® is a registered trademark of the TYPO3 Association; Ghost® is a registered trademark of the Ghost Foundation; Amazon Web Services, Inc. owns the rights to AWS® and Amazon SES®; Google LLC owns the rights to Google Cloud™, Chrome™, and Google Kubernetes Engine™; Alibaba Cloud® is a registered trademark of Alibaba Group Holding Limited; DigitalOcean® is a registered trademark of DigitalOcean, LLC; Linode® is a registered trademark of Linode, LLC; Vultr® is a registered trademark of The Constant Company, LLC; Akamai® is a registered trademark of Akamai Technologies, Inc.; Fastly® is a registered trademark of Fastly, Inc.; Let's Encrypt® is a registered trademark of the Internet Security Research Group; Microsoft Corporation owns the rights to Microsoft®, Azure®, Windows®, Office®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®; Apache® is a registered trademark of The Apache Software Foundation; Apache Tomcat® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group; Docker® is a registered trademark of Docker, Inc.; Kubernetes® is a registered trademark of The Linux Foundation; OpenShift® is a registered trademark of Red Hat, Inc.; Podman® is a registered trademark of Red Hat, Inc.; Proxmox® is a registered trademark of Proxmox Server Solutions GmbH; VMware® is a registered trademark of Broadcom Inc.; CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV; Grafana® is a registered trademark of Grafana Labs; Prometheus® is a registered trademark of The Linux Foundation; Zabbix® is a registered trademark of Zabbix LLC; Datadog® is a registered trademark of Datadog, Inc.; Ceph® is a registered trademark of Red Hat, Inc.; MinIO® is a registered trademark of MinIO, Inc.; Mailgun® is a registered trademark of Mailgun Technologies, Inc.; SendGrid® is a registered trademark of Twilio Inc.; Postmark® is a registered trademark of ActiveCampaign, LLC; cPanel®, LLC owns the rights to cPanel®; Plesk® is a registered trademark of Plesk International GmbH; Hetzner® is a registered trademark of Hetzner Online GmbH; OVHcloud® is a registered trademark of OVH Groupe SAS; Terraform® is a registered trademark of HashiCorp, Inc.; Ansible® is a registered trademark of Red Hat, Inc.; cURL® is a registered trademark of Daniel Stenberg; Facebook®, Inc. owns the rights to Facebook®, Messenger® and Instagram®. This site is not affiliated with, sponsored by, or otherwise associated with any of the above-mentioned entities and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. All other trademarks mentioned are the property of their respective registrants.

JUST A MOMENT !

Have you ever wondered if your hosting sucks?

Find out now if your hosting provider is hurting you with a slow website worthy of 1990! Instant results.

Close the CTA
Back to top