Most Major News Publishers Block AI Training & Retrieval Bots

Introduction to AI Training Bots and News Publishers

Most top news publishers block AI training bots via robots.txt, but they’re also blocking the retrieval bots that determine whether sites appear in AI-generated answers. A study by BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found that 79% block at least one training bot. More notably, 71% also block at least one retrieval or live search bot.

What The Data Shows

The study examined the top 50 news sites in each market based on SimilarWeb traffic share, then deduplicated the list. The study grouped bots into three categories: training, retrieval/live search, and indexing. Among training bots, Common Crawl’s CCBot was the most frequently blocked at 75%, followed by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%. Google-Extended, which trains Gemini, was the least blocked training bot at 46% overall.

Training Bot Blocks

US publishers blocked Google-Extended at 58%, nearly double the 29% rate among UK publishers. According to Harry Clarkson-Bennett, SEO Director at The Telegraph, “Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.”

- Advertisement -

Retrieval Bot Blocks

The study found 71% of sites block at least one retrieval or live search bot. Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot, which powers ChatGPT’s live search, was blocked by 49%. ChatGPT-User was blocked by 40%. Perplexity-User, which handles user-initiated retrieval requests, was the least blocked at 17%.

Indexing Blocks

PerplexityBot, which Perplexity uses to index pages for its search corpus, was blocked by 67% of sites. Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none.

The Enforcement Gap

The study acknowledges that robots.txt is a directive, not a barrier, and bots can ignore it. We covered this enforcement gap when Google’s Gary Illyes confirmed robots.txt can’t prevent unauthorized access. It functions more like a “please keep out” sign than a locked door. Clarkson-Bennett raised the same point in BuzzStream’s report: “The robots.txt file is a directive. It’s like a sign that says please keep out, but doesn’t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.”

Why This Matters

The retrieval-blocking numbers warrant attention here. In addition to opting out of AI training, many publishers are opting out of the citation and discovery layer that AI search tools use to surface sources. OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Blocking one doesn’t block the other. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval.

Looking Ahead

The robots.txt method has limits, and sites that want to block AI crawlers may find CDN-level restrictions more effective than robots.txt alone. Cloudflare’s Year in Review found GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role Google’s crawler plays in search indexing and AI training.

Conclusion

For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now. Publishers need to understand the implications of blocking AI bots and make informed decisions about their online presence. As the use of AI continues to grow, it’s essential to find a balance between protecting content and allowing discovery. By understanding the types of bots and their functions, publishers can make better decisions about their robots.txt files and ensure their content is visible to the right audiences.

How To Get Your...

The Free Traffic Blueprint:...

Maximize Your Reach: The...

The Benefits of Mobile-First...

Most Major News Publishers Block AI Training & Retrieval Bots

Introduction to AI Training Bots and News Publishers

What The Data Shows

Training Bot Blocks

Retrieval Bot Blocks

Indexing Blocks

The Enforcement Gap

Why This Matters

Looking Ahead

Conclusion

ChatGPT’s Default & Premium Models Search The Web Differently

Sam Altman Says OpenAI “Screwed Up” GPT-5.2 Writing Quality

Google Says Links Will Be More Visible In AI Overviews

Chrome Updated With 3 AI Features Including Nano Banana

Google Tests AI Headlines, Rolls Out Spam Update –...

Google Answers Questions About Search Console’s Branded Queries Filter

ChatGPT’s Default & Premium Models Search The Web Differently

WordPress Gutenberg 22.7 Lays Groundwork For AI Publishing

WordPress Releases AI Plugins For Anthropic Claude, Google Gemini, And OpenAI

Google Tests AI Headlines, Rolls Out Spam Update – SEO Pulse

Google Answers Questions About Search Console’s Branded Queries Filter

ChatGPT’s Default & Premium Models Search The Web Differently

WordPress Gutenberg 22.7 Lays Groundwork For AI Publishing

About Blog Traffic Guide

Categories to explore

Useful Links

Our Newsletter

Explore the website

Looking for something?

Explore the website

Looking for something?

Explore the website

Looking for something?

Most Major News Publishers Block AI Training & Retrieval Bots

Introduction to AI Training Bots and News Publishers

What The Data Shows

Training Bot Blocks

Retrieval Bot Blocks

Indexing Blocks

The Enforcement Gap

Why This Matters

Looking Ahead

Conclusion

About Blog Traffic Guide

Categories to explore

Useful Links

Our Newsletter