The Closing Web: AI Crawler Blocking and Agent Access

Ryan Turner · Head of InnovationJune 5, 2026

The web that was open to anonymous crawlers is closing. Default blocking and paid-access marketplaces are replacing the old free-for-all. As a result, agent access now splits into two paths: licensed or paid crawl where it exists, or arriving as a real user the rest of the time. If your agent still assumes it can fetch any public URL on a datacenter IP, it is building on ground that is disappearing under it.

Key Takeaways

On July 1, 2025, Cloudflare began blocking AI crawlers by default across roughly 20% of the web and launched a pay-per-crawl marketplace (Cloudflare, Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large).
Major news sites have moved to default-deny: ~79% block AI training bots, ~49% disallow GPTBot by name.
The trigger is economics: crawl-to-referral ratios reached ~38,000:1 for one major crawler. Sites get taken from, not sent traffic.
Training crawlers and real-time agent retrieval get caught in the same nets. The agents that keep working look like real users in the right geo, or pay for licensed access.

What changed: the web went default-deny

In 2025, the defaults flipped. The biggest single event was Cloudflare, which on July 1 began blocking AI crawlers by default across roughly 20% of the web and launched a pay-per-crawl marketplace (Cloudflare, Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large). Pay-per-crawl is a marketplace where a site charges bots for access it used to give away for free. In effect, one config change moved a fifth of the web from opt-out to opt-in.

That was not a niche policy shift. Bots are no longer a minority of traffic. In 2024, automated bots crossed 51% of all web traffic for the first time in a decade, with bad bots at 37% (Imperva, 2025 Bad Bot Report). When most requests hitting your origin are machines, blocking machines by default stops looking aggressive. Instead, it starts looking like basic hygiene.

The news industry moved first and hardest. By 2025, roughly 79% of the world's biggest news websites blocked AI training bots, and around 49% disallowed GPTBot by name (Press Gazette, Eight in ten of world's biggest news websites now block AI training bots). As a result, robots.txt went from a polite suggestion to a default-deny posture for the AI category. The open-crawl path did not end overnight. Nevertheless, the trend line is clear, and it points one direction.

Why it happened: the crawl-to-referral collapse

The reason is economics, not ideology. The old bargain was simple. Crawlers indexed your content, and search sent you visitors in return. AI crawling broke that loop. In mid-2025, Anthropic's crawler hit roughly 38,000 pages per referred visitor, and OpenAI's GPTBot ran about 3,700:1 (Cloudflare, The crawl before the fall of referrals). Consequently, publishers do the math and see content leaving with almost nothing flowing back.

It gets sharper when you look at what the crawling is for. AI crawling splits roughly 80% training, 18% search, and only 2% user actions (Cloudflare, A deeper look at AI crawlers). Four-fifths of it feeds model training, which sends back zero referrals by design. From a site owner's chair, therefore, that is pure extraction, and blocking is the rational response.

The volume is climbing too, which raises the stakes. AI and search crawler traffic rose 18% year over year into 2025, and GPTBot's share of AI-crawler requests jumped from 5% to 30% in a year, a 305% increase in raw requests (Cloudflare, From Googlebot to GPTBot: who's crawling your site in 2025). More load, no return traffic, and easy tooling to block it. As a result, default-deny was inevitable.

What it means for agents: caught in the same net

Here is the trap that catches engineering teams. Training crawlers and real-time agent retrieval are different things. A training crawler scrapes millions of pages to build a dataset. Your agent, in contrast, fetches three pages to answer one user's question right now. However, the site does not see intent. It sees an automated request from a known bot user-agent or a flagged IP range, and it applies the same default-deny rule to both. That is why "the web is closing to AI" hits agents that never touch training data. The blocking infrastructure does not distinguish a retrieval agent from a scraper. Instead, it distinguishes humans from bots, and increasingly it distinguishes known-good IP space from datacenter ranges. In short, an honest agent on a cloud IP looks identical to a hostile scraper.

Datacenter IPs are addresses owned by cloud and hosting providers, the ranges anti-bot systems flag first because no ordinary person browses from them. Specifically, they are the first thing modern anti-bot detection flags in 2026, which is the core reason agents fail on protected targets. We cover the mechanics in why agents get blocked on datacenter IPs, but the short version is that an honest agent on a cloud IP reads as hostile.

So the access question splits two ways, and both have a place. Where a licensed or paid path exists, such as a pay-per-crawl deal or an official API, take it. It is the cleanest option, and it survives the closing web by definition. Everywhere else, the durable path is to arrive as a real user: a request that originates from a residential or mobile device in the geo the content expects, rendering the page the way a person's browser would. Residential proxies are connections that route through real consumer devices, so the request carries an ISP-assigned address a site treats as an ordinary visitor. The choice between those network types is its own decision, which we break down in residential vs datacenter proxies.

This is the part most teams underestimate until it breaks production. As the open-crawl path closes, the agents that keep working are the ones that do not look like crawlers at all. In our experience across agent workloads, real-user-device access, arriving as an organic local visitor with clean rendering, is what stays reliable when default-deny is the norm. That is the positioning behind Massive's device-access network plus rendering stack: real consumer devices across 195+ countries with country, subdivision, and city geotargeting, returning clean HTML or markdown from any public source in any location. From our work with teams, we see them bring it in as a fallback for the targets that broke, then move it to primary once the ticket queue disappears. When the DIY proxy-plus-headless-browser stack stops paying for itself, the next step is usually managed infrastructure, which we get into in managed browser infrastructure.

For the full architecture of giving an agent durable live access, start from the pillar on how to give AI agents live web access. This trend is one input into that design, not the whole story.

What to do now: build for the closing web

Plan as if default-deny is the baseline, because in 2025 it became one. Cloudflare put roughly 20% of the web behind opt-in access in a single move (Cloudflare, Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large), and adoption only grows from there. Therefore, design your access layer assuming the easy targets will harden, not assuming today's open URLs stay open.

Three practical moves follow from the data. First, separate your targets into "licensed/paid path available" and "must arrive as a real user," then route each accordingly. Second, stop sending agent traffic from raw cloud IPs, since the detection edge flags them before your request body is even read. Third, prefer clean markdown or HTML output over raw page dumps, because your LLM pays for every token of clutter you feed it. For example, we tested residential against datacenter egress on protected sites and measured residential success landing far higher (rough ranges, residential ~85-99% versus datacenter ~20-40%). Treat that as a vendor benchmark, not independent research. That said, the direction matches what the detection trend predicts.

Sources

Imperva, 2025 Bad Bot Report, 2025. https://www.imperva.com/resources/resource-library/reports/2025-bad-bot-report/
Cloudflare, Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large, 2025. https://www.cloudflare.com/press/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/
Cloudflare, The crawl before the fall of referrals, 2025. https://blog.cloudflare.com/crawlers-click-ai-bots-training/
Cloudflare, A deeper look at AI crawlers, 2025. https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/
Cloudflare, From Googlebot to GPTBot: who's crawling your site in 2025, 2025. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
Press Gazette, Eight in ten of world's biggest news websites now block AI training bots, 2025. https://pressgazette.co.uk/platforms/eight-in-ten-of-worlds-biggest-news-websites-now-block-ai-training-bots/

Frequently Asked Questions

Is the open web actually closing, or is this hype?+

The defaults changed, which is the part that matters. In 2025, Cloudflare moved ~20% of the web to block AI crawlers by default, and ~79% of major news sites now block AI training bots (Cloudflare; Press Gazette). Open URLs still exist. However, default-deny is now the trend, not the exception.

My agent only retrieves a few pages, not training data. Why is it blocked?+

Because the blocking infrastructure cannot see intent. It flags bot user-agents and datacenter IP ranges, and it applies the same rule to a three-page retrieval agent and a million-page training crawler. AI crawling is roughly 80% training (Cloudflare). Consequently, sites default to denying the whole category.

Why are publishers blocking instead of just charging?+

Both, increasingly. The trigger is the crawl-to-referral collapse: one major crawler hit ~38,000 pages crawled per referred visitor in 2025 (Cloudflare). Pay-per-crawl marketplaces, meanwhile, let sites charge for access they used to give away, which is the paid half of the new split.

What is the durable access path for agents now?+

Two paths. Where licensed or paid access exists, use it. Everywhere else, arrive as a real user: a request from a residential or mobile device in the expected geo, with clean rendering. As a result, you avoid the datacenter-IP flag that catches most agents on protected sites.