Cloudflare introduces new feature to block AI bots scraping website content | How to use Dalle 3 | Dall e 3 Image Generator Python | Dall - e Mini Website | Turtles AI
Highlights:
- New Cloudflare Feature: Block AI bots with a single click.
- Issues with robots.txt: Often ignored by AI bots.
- Technology Used: Machine learning and digital fingerprinting.
- Future Concerns: Continuous adaptation by AI companies to evade blocks.
Cloudflare announced a new feature for its web hosting customers that allows them to block AI bots from collecting data from websites without permission, addressing widespread concerns among customers and protecting online content creators.
Cloudflare has launched a new option for web hosting customers to prevent AI bots from collecting data from websites without authorization, a move that addresses widespread customer concerns about the unauthorized use of their data for training machine learning models. This new feature, activatable with a single click, aims to "preserve a safe internet for content creators," as stated by the company.
Currently, a widely used method to block bots is the robots.txt file, placed in the root directory of the website, which automated web crawlers are expected to notice and comply with. However, many believe that generative AI is based on "stolen" content, and numerous lawsuits are ongoing to hold AI companies accountable.
Last August, OpenAI published guidelines on how to block its GPTbot crawler using the robots.txt file, while Google followed suit the following month. In September of the same year, Cloudflare began offering a way to block rule-respecting AI bots, and 85% of customers enabled this option, according to the company.
Cloudflare observed that AI bots visit about 39% of the top websites managed by its platform, raising concerns among webmasters. Unfortunately, the robots.txt file, like the Do Not Track header in browsers, can be ignored without serious consequences. Recent reports suggest that AI bots often disregard these directives.
For instance, last week Amazon stated it was investigating evidence that bots used by Perplexity, an AWS client, had crawled websites, including news sites, reproducing their content without proper attribution or permission. Perplexity CEO Aravind Srinivas denied that his company was deliberately ignoring the robots.txt file but admitted that third-party bots used by Perplexity might have violated the rules.
Cloudflare stated it had observed bot operators attempting to disguise themselves as real browsers by using spoofed user agents. Their machine learning-based scoring system consistently rated these bots below 30, indicating they are "likely automated."
This new feature relies on digital fingerprinting, a technique commonly used to track people online and deny them privacy. Crawlers, like internet users, stand out due to technical details visible through network interactions. With a network seeing an average of 57 million requests per second, Cloudflare has ample data to determine which fingerprints can be trusted.
The defense against bots seeking to feed AI models is now available even to free-tier customers of Cloudflare. To activate the feature, simply click the "Block AI Scrapers and Crawlers" button in the Security -> Bots menu for your website.
Cloudflare expressed concern that some AI companies might persist in adapting to evade detection, but the company promised to continue monitoring and adding more blocks for AI bots, evolving their machine learning models to help keep the internet a safe place for content creators, allowing them to maintain full control over which models can use their content for training or inference.