Table of contents of the article:
The rapid development of artificial intelligence (AI) has opened new frontiers in information processing, but has also raised significant ethical and legal questions. Recently, it has emerged that several AI companies are ignoring web standards for content acquisition, such as the “robots.txt” protocol, raising concerns among publishers and digital content experts. This article will explore the implications of these practices, analyzing the consequences for the media industry and discussing possible solutions.
Context and meaning of the “robots.txt” protocol
The “robots.txt” protocol was introduced in the 90s to allow website owners to control which parts of their site could be indexed by search engine crawlers. This standard has become a mainstay for ensuring that web content is not overloaded with automated requests, while protecting the rights of content owners.
The robots.txt Directives and the Crawl Delay
The “robots.txt” file not only indicates which pages a bot can and cannot visit, but also offers crucial directives such as “crawl delay”. The “crawl delay” is a parameter that specifies the delay that a bot must respect between one request and another to the server. This directive is crucial to prevent a website from being overloaded with requests, which could cause a significant increase in CPU load and server resources.
The problem of AI companies ignoring directives
Many AI companies do not comply with these guidelines, causing a significant increase in load on website servers. This problem is especially acute for large sites with hundreds of thousands of pages or products. When several bots, both legitimate and AI, crawl a site simultaneously, CPU load can grow exponentially, reaching unsustainable levels. Additionally, the load on the database increases dramatically, with continuous queries overloading database resources. PHP processes, often used to generate dynamic content, can slow down or even crash, making the situation even worse.
Case Study: Real Impact on Server Resources
A practical example of this issue concerns one of our customers, who experienced significant overhead due to simultaneously scanning more than eight emerging AI bots. These bots continued to crawl the site for over eight hours, causing CPU load to increase by more than 900% compared to normal levels in the past few months. This overload led to slow site performance and risked causing a complete crash.
The Perplexity case and publishers' response
An emblematic example of this problem is the conflict between Forbes and Perplexity, an AI search startup that develops tools to generate automatic summaries. Forbes has publicly accused Perplexity of using its investigative articles to generate AI summaries without permission, bypassing restrictions imposed by the “robots.txt” protocol. An investigation by Wired confirmed that Perplexity is likely bypassing the protocol to bypass the blocks.
This case has raised significant alarms in the News Media Alliance, a trade group representing more than 2.200 publishers in the United States. President Danielle Coffey highlighted how failure to stop these practices could seriously undermine the media industry's ability to monetize its content and pay journalists.
The role of TollBit
In response to these problems, TollBit has emerged, a startup that positions itself as an intermediary between AI companies and publishers. TollBit monitors AI traffic on publisher websites and uses advanced analytics to help both parties negotiate licensing fees for content usage.
TollBit reported that not only Perplexity, but numerous AI agents are bypassing the “robots.txt” protocol. The company has collected data from multiple publishers that shows a clear pattern of protocol violations by different AI sources, indicating a widespread problem in the industry.
The legal implications and future prospects
The “robots.txt” protocol has no clear legal enforcement mechanism, which complicates publishers' ability to defend against these practices. However, there are signs that some groups, such as the News Media Alliance, are exploring possible legal action to protect their rights.
Meanwhile, some publishers are taking different approaches. For example, The New York Times has taken legal action against AI companies for copyright infringement, while others are signing licensing deals with AI companies willing to pay for content. However, there is still wide disagreement about the value of materials provided by publishers.
Conclusion
The unauthorized use of web content by AI companies represents a significant problem for the media industry. As AI technologies continue to evolve, it is crucial to establish a balance that protects the rights of content creators while ensuring technological innovation. Initiatives like TollBit's and possible legal action could be important steps towards a fair solution, but dialogue between the parties involved remains essential to building a sustainable future for all.