Skip to main content
web60

Industry News

Robots.txt Is Not Blocking AI Scrapers. Here Is What Actually Does.

Ian O'Reilly··9 min read
Abstract geometric shapes representing AI crawler bots flowing past a thin line barrier on a warm grey background, with teal accents suggesting bypass and motion

You have probably heard the advice. Drop a few lines into your robots.txt file disallowing GPTBot, ClaudeBot, and PerplexityBot, and the WordPress site you spent years building is supposedly off-limits to the bots scraping content for AI training. Job done.

It does not work that way. It probably never did. Cloudflare confirmed it last summer in writing, and the picture in 2026 is worse than most business owners realise. Robots.txt is a polite request. It is not a control. The bots that matter most have learned to ignore it.

The Honour System That Was Always Optional

Robots.txt is a plain text file at the root of your website. When a crawler arrives, it is supposed to fetch that file and respect whatever you have disallowed. Googlebot does. Bing's crawler does. Most well-behaved indexers do.

Here is the catch: nothing in your hosting stack enforces it. Robots.txt is a convention from 1994. It has no teeth. A crawler that wants your content has zero technical obligation to fetch the file, let alone obey it.

For thirty years, that hardly mattered. Most crawler traffic came from search engines that wanted to play by the rules because their inclusion in your index depended on it. The arrival of AI training crawlers changed the deal. The crawler does not need your site indexed in Google. It just needs your text.

What Cloudflare's Perplexity Investigation Actually Showed

In August 2025, Cloudflare published an investigation that should have been the moment every site owner stopped treating robots.txt as a defence [1]. Their data showed Perplexity operating two crawlers in parallel.

The first was the declared PerplexityBot. Sites that wanted to block it could, and aggressive operators did. The second was a stealth crawler. It identified itself as a generic Chrome browser on macOS. It rotated through IPs outside Perplexity's published range. It changed source ASNs in response to blocks. Cloudflare measured 3 to 6 million daily requests from this second crawler, on top of 20 to 25 million from the declared one.

Perplexity disputed the framing. The behaviour was documented and reproducible on freshly created test domains.

Two things matter here for a small business website owner. First, this was not a hypothetical edge case. It was operating at scale across the Cloudflare network. Second, Cloudflare's separate analysis showed the share of bots ignoring robots.txt jumped from roughly 3.3% to 12.9% over the previous year [2]. That four-fold increase is not just Perplexity. It is the new normal.

The Hosting Bill Side Effect Most Owners Miss

This is where it becomes a business problem rather than a curiosity. AI training bots fetch heavily and refer almost nothing back. Cloudflare's July 2025 measurement put ClaudeBot's crawl-to-referral ratio at roughly 38,000 page requests for every single human visitor referred [2]. GPTBot was closer to 1,091 to 1. PerplexityBot was the only one with anything resembling a useful ratio at roughly 195 to 1.

In practical terms: the bots scraping your content for training are consuming serious bandwidth and CPU on your hosting plan, and almost none of them are sending you customers in return. That is real money on the wrong side of the ledger.

Reviewing edge logs across managed WordPress sites this month, the pattern is consistent. Three or four AI bots in rotation, hitting every page of a small business site multiple times a day, ignoring whatever directive sits at /robots.txt.

A typical case looks like this. A Limerick accountancy firm watches their site slow down every evening between 7pm and 11pm. The owner suspects a plugin issue. It is not. Three different AI bots are taking turns fetching every page of the site in a continuous rotation. Their robots.txt file blocks all three by name. None of them are honouring it. The slow pages are not the cause of a problem. They are the symptom of one the owner cannot see from the WordPress dashboard.

Layered concentric shapes representing infrastructure-level defences filtering out incoming geometric bot traffic on a warm grey background
The defence has to live at the server layer, not in a text file.

Why Robots.txt Still Sits in Every Recommendation List

If it does not work, why does every WordPress security guide still tell you to add the disallow lines? Two reasons.

The first is genuine. The well-behaved bots do still honour it. If you do not want OpenAI training on your content, and the official GPTBot is the only thing hitting your site, robots.txt does the job. For a thoughtful publisher who actively wants partial cooperation with AI, robots.txt remains the right tool.

The second reason is less generous. Most of the advice was written before the Cloudflare data was published. It has not been updated. The recommendation is correct in principle and increasingly wrong in practice. The infrastructure question matters because aggressive crawler load does not just affect one defence layer. It affects the whole performance stack that keeps a WordPress site fast.

What Actually Works at the Infrastructure Level

The defence has to move from the polite-request layer to the server layer. Three things genuinely help.

Behaviour-based rate limiting. A crawler that requests forty pages per second from a single IP looks nothing like a human visitor. Server-level rules that throttle or block by request pattern catch aggressive bots regardless of what user-agent they present. Nginx and similar edge layers support this natively, provided your hosting provider has configured it.

HTTP 429 responses on shared infrastructure. IONOS stated publicly in 2025 that they return 429 Too Many Requests to training bots on shared hosting plans because the load was degrading performance for legitimate customers. That kind of provider-level response is far more useful than a robots.txt file the bot will never read.

Edge challenge layers. Cloudflare rolled out an AI Crawl Control product in 2025 [3] aimed specifically at this problem. It performs behavioural detection on suspected scrapers regardless of declared user-agent. Other edge providers have followed.

The common pattern: none of this depends on the crawler being honest about who it is. The approach assumes the opposite and responds to behaviour. This is the standard you should be measuring your hosting provider against. Not "do you support robots.txt." Of course they do. The right question is whether your stack rate-limits aggressive crawlers automatically and whether there is a real operations team watching for the behaviour you cannot see. That broader picture connects to every other performance decision a business owner needs to make about their site.

Web60's managed WordPress hosting handles this at the Nginx layer, with rate limiting and fail2ban intrusion prevention applied per site. Not because it is a marketing feature. Because the alternative is paying for the bandwidth those scrapers consume. The whole platform runs on sovereign Irish infrastructure with the kind of edge controls that turn the AI crawler problem into a non-event for the site owner.

The Honest Limitation

Server-level blocking is not perfect. A sufficiently determined scraper running residential proxies and pacing requests carefully will get through. The goal is not zero scraping. The goal is keeping the load manageable and the obvious offenders out.

Across the managed WordPress sites we monitor, behaviour-based rate limiting catches what looks like the majority of aggressive scraping traffic, though the exact figure varies by site and by month. The remainder is a trickle most owners would never notice. That is the trade-off worth making. The alternative, leaving robots.txt as the only line of defence, catches the polite minority and waves through everyone else.

Where Robots.txt Still Makes Sense

If your site is a blog with significant organic traffic and your goal is to participate in AI search referrals rather than block them, robots.txt is genuinely the right tool. You list what is open, you list what is closed, the cooperative bots respect it, and you accept that uncooperative ones will fetch what they want anyway. For a content publisher chasing AI-search visibility, that trade-off is rational. The benefit can outweigh the cost.

For a service business, a local retailer, or a professional services firm whose website exists to attract paying customers and not to train language models, the calculation does not apply. The cost is real. The benefit is zero. The defence should look entirely different.

The Question Worth Asking Your Hosting Provider

Open a support ticket with your current hosting provider today and ask one thing. What does your platform do automatically when a single source starts fetching hundreds of pages per minute from my site?

If the answer is "we recommend updating your robots.txt file," you have an honour-system defence on infrastructure the dishonest bots have already learned to ignore. If the answer is some version of "we rate-limit at the server level and our operations team monitors for it," you have an actual defence.

The robots.txt myth is a comforting one. It costs nothing to believe. The bandwidth charges, the slower pages, and the resource consumption it lets through are real. Knowing which question to ask is most of the work.

Sources

IO
Ian O'ReillyOperations Director, Web60

Ian oversees Web60's hosting infrastructure and operations. Responsible for the uptime, security, and performance of every site on the platform, he writes about the operational reality of keeping Irish business websites fast, secure, and online around the clock.

More by Ian O'Reilly

Ready to get your business online?

Describe your business. AI builds your website in 60 seconds.

Build My Website Free →