Pay Per Crawl

AI race for glory and… mostly money is speeding up.

Practically every week we’re flooded with new models and tools, and when we listen to announcements of even newer ones, it’s easy to start doubting whether we’ll be able to find our way in the market at all.

No matter if you’re an AI evangelist or you hate it, you should admit that such a scale of data absorption can cause some problems — especially with intellectual rights and the way of accessing such data.

One of the loudest companies trying to slow down this massive web data extraction is Cloudflare — a company focused on providing fast, secure, and reliable content delivery and DDoS protection, while also offering anti-bot solutions to defend websites from unwanted traffic.

Some time ago I wrote an article about Cloudflare’s AI Labyrinth, which was their first big announcement on how to fight back against big AI companies.

Now they’ve announced they’re rolling out another big gun.

Blocking bots by default

First of all, Cloudflare decided to announce 1st of July as “Content Independence Day” and start blocking bots by default for sites using protection.

In short, it means that more websites will be protected from bots.

But does it change a lot? I don’t think so. Websites suffering from being “harassed” by AI companies are usually big, well-maintained services. And those services, if using Cloudflare, had turned on this bot protection even before.

Bot Access Management

A much more interesting part of this news is that Cloudflare has promised full control over permissions for bots based on robots.txt.

It’s definitely a good direction because restricting bot access for all pages makes no sense. You might not want AI to teach itself on your research papers and recreate your graphic designs for other users, but I guess if someone asks AI about “best expert in [your niche]” you’d like to be mentioned. So it’s a good idea to leave some marketing materials for AI to look through, am I right? :)

And that’s the best part! And the main reason for all of this buzz.

Cloudflare decided to allow website owners to monetize content by requiring AI bots to pay for accessing it.

People are mad at AI training on their data mostly because they’re not getting paid for it. And they’re 100% right. So Cloudflare thought that people might be more willing to let AI read their content if they get paid.

It would mean that AI companies would still be able to train models on new data, website owners would get another source of income, and Cloudflare would have everything under control.

But how is it really going to work? And how will Cloudflare know that this bot is from OpenAI, Google, or ByteDance?

Message Signatures — Verified Bots

Cloudflare has introduced a new approach to bot authentication with the integration of HTTP Message Signatures into its Verified Bots Program — a program for massive data scrapers to register themselves officially on Cloudflare resources.

Is it a game-changer? Well, kind of.

How bots are identified

How should we understand “identifying bot”? In this paragraph, I mean understanding that particular requests are coming from e.g. OpenAI servers, and not Anthropic nor xAI.

Usually, bots are identified with two things:

User agent
IP

But both identification methods have their flaws — even combined.

Spoofing user-agent is super easy. You just need to specify it in HTTP headers.

import requests

url = "https://example.com"

headers = {
    "User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"
}

response = requests.get(url, headers=headers)
...

Years ago it worked most of the time; you could even access paywall-protected content on big journal websites. But this problem was quickly addressed by implementing another layer of verification — IP.

And as spoofing user agent is easy, you basically can’t spoof the IP of Google or Microsoft.

So if IP identification is that robust, why are those “message signatures” needed?

Back then, there were only a few bots happily accepted on every site — Googlebot, bots of other search engines (Microsoft, Yahoo, etc.), and a few more.

Now there are more acceptable search engines, many more internet crawling companies with a positive impact, and — more importantly — a lot of AI companies.

So now it’s almost impossible to keep records of which IPs belong to whom. And it’s even harder if we want to optimize it on a huge scale.

To address that, Cloudflare came up with the idea that bot identification should be handled not by them, the hosting provider, or the website owner, but by the company that is using this bot. It’s definitely a cheaper way of getting things done, but bot creators have to feel the NEED to be identified.

Can Cloudflare really fight bots?

There are a few anti-bot systems out there — Cloudflare, Google ReCaptcha, hCaptcha, DataDome, AWS WAF, etc.

And from my scraping experience, Cloudflare is not the hardest to overcome. That doesn’t mean it’s super easy to scrape Cloudflare-protected websites at high scale. It requires experience in bypassing those systems and using proper tools. So if your bot was able to scrape a Cloudflare-protected site, you probably won’t be blocked by these new systems.

As Cloudflare is not the hardest anti-bot system to get through for smaller scrapers, they look like they’re able to detect bots from AI companies with a high success rate. I really believe so because those companies definitely stand out from the crowd by their incredible scale.

Will bots from AI companies get through Pay Per Crawl?

And here comes another issue — if a small web scraping company with low resources is able to bypass Cloudflare bot protection, for big companies like OpenAI it should be easy.

Well, yes and no.

Bypassing anti-bot systems requires slowing down scraping speed and increases costs. It’s not a problem if you have 10k, 100k, or 1 million URLs to scrape. But at a bigger scale it will be a problem — maybe a problem too hard to overcome, especially when keeping the budget in mind.

Another issue is legal stuff. Automated data collection from some sites is forbidden by terms of use. In the USA it should be okay to scrape anything that is not behind authentication, but it can cause legal problems anyway.

Summary

Cloudflare made a great move towards earning on the AI movement. If their plan works out, they’ll benefit from AI companies needing new data and allow content creators to earn additional money. But it’s the market that will decide how the situation will be.

I feel like this new ecosystem around blocking bots is really aimed at fighting those big players like ByteDance, OpenAI, or Anthropic. And let’s be honest, those companies sending millions of requests per second to websites was definitely an issue. There is a chance that these new strategies will not affect smaller scrapers that are willing to maintain infrastructure allowing scraping even of protected websites.

I’m curious to see how it will affect the AI industry and if AI companies will really be paying for data access.

Sources:

Thanks for reading! Kamil Kwapisz